csv - Spark RDD External Storage -
i have written python code sum.py on summing numbers each csv file in directory data. going use apache-spark on amazon web service (aws), parallelize summation process each csv file. have done following steps:
- i've created 1 master , 2 slave nodes on aws.
- i used bash command
$ scp -r -i my-key-pair.pem my_dir root@host_nameupload directorymy_dironto aws cluster master node. foldermy_dircontains 2 sub-directories:code,data, in which,codecontains python codesum.py, ,datacontains csv files. - i've login aws master node, , there used bash command
$ ./spark/copy-dir /my_dir/code/send code directorycodecontainssum.pyslave nodes. - on aws master node, i've put directory
datacontaining csv files hdfs using$ ./ephemeral-hdfs/bin/hadoop fs -put /root/my_dir/data/.
now when submit application on aws master node: $ ./spark-submit ~/my_dir/code/sum.py, shows error worker node cannot find csv files. however, after send data directory data slave nodes using command copy-dir, works perfectly.
so confused problem. far know, driver program on master node loads csv files, creates rdd , sends separate tasks rdd each of slave nodes. means slave nodes don't need know original csv files, receive rdd master node. if true, why should send csv files each of slave node? also, if send csv files slave nodes, external disk storage on slave nodes gonna used. mean apache-spark costly tool parallel computing? appreciate if helps me on these 2 questions.
yes, have make data available nodes. however, each node try best load data concerned (its partition), , can tune level of parallelism best fit task. there many ways make data available nodes besides copying on each node's file system. consider using distributed file system, hdfs, or hosting files in accessible location each node, includes s3 or file server.
Comments
Post a Comment