csv - Spark RDD External Storage -


i have written python code sum.py on summing numbers each csv file in directory data. going use apache-spark on amazon web service (aws), parallelize summation process each csv file. have done following steps:

  1. i've created 1 master , 2 slave nodes on aws.
  2. i used bash command $ scp -r -i my-key-pair.pem my_dir root@host_name upload directory my_dir onto aws cluster master node. folder my_dir contains 2 sub-directories: code , data, in which, code contains python code sum.py, , data contains csv files.
  3. i've login aws master node, , there used bash command $ ./spark/copy-dir /my_dir/code/ send code directory code contains sum.py slave nodes.
  4. on aws master node, i've put directory data containing csv files hdfs using $ ./ephemeral-hdfs/bin/hadoop fs -put /root/my_dir/data/.

now when submit application on aws master node: $ ./spark-submit ~/my_dir/code/sum.py, shows error worker node cannot find csv files. however, after send data directory data slave nodes using command copy-dir, works perfectly.

so confused problem. far know, driver program on master node loads csv files, creates rdd , sends separate tasks rdd each of slave nodes. means slave nodes don't need know original csv files, receive rdd master node. if true, why should send csv files each of slave node? also, if send csv files slave nodes, external disk storage on slave nodes gonna used. mean apache-spark costly tool parallel computing? appreciate if helps me on these 2 questions.

yes, have make data available nodes. however, each node try best load data concerned (its partition), , can tune level of parallelism best fit task. there many ways make data available nodes besides copying on each node's file system. consider using distributed file system, hdfs, or hosting files in accessible location each node, includes s3 or file server.


Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -