csv - Spark RDD External Storage -
i have written python code sum.py
on summing numbers each csv file in directory data
. going use apache-spark on amazon web service (aws), parallelize summation process each csv file. have done following steps:
- i've created 1 master , 2 slave nodes on aws.
- i used bash command
$ scp -r -i my-key-pair.pem my_dir root@host_name
upload directorymy_dir
onto aws cluster master node. foldermy_dir
contains 2 sub-directories:code
,data
, in which,code
contains python codesum.py
, ,data
contains csv files. - i've login aws master node, , there used bash command
$ ./spark/copy-dir /my_dir/code/
send code directorycode
containssum.py
slave nodes. - on aws master node, i've put directory
data
containing csv files hdfs using$ ./ephemeral-hdfs/bin/hadoop fs -put /root/my_dir/data/
.
now when submit application on aws master node: $ ./spark-submit ~/my_dir/code/sum.py
, shows error worker node cannot find csv files. however, after send data directory data
slave nodes using command copy-dir
, works perfectly.
so confused problem. far know, driver program on master node loads csv files, creates rdd , sends separate tasks rdd each of slave nodes. means slave nodes don't need know original csv files, receive rdd master node. if true, why should send csv files each of slave node? also, if send csv files slave nodes, external disk storage on slave nodes gonna used. mean apache-spark costly tool parallel computing? appreciate if helps me on these 2 questions.
yes, have make data available nodes. however, each node try best load data concerned (its partition), , can tune level of parallelism best fit task. there many ways make data available nodes besides copying on each node's file system. consider using distributed file system, hdfs, or hosting files in accessible location each node, includes s3 or file server.
Comments
Post a Comment