Spark & Scala - NullPointerException in RDD traversal -


i have number of csv files , need combine them rdd part of filenames.

for example, below files

$ ls    20140101_1.csv  20140101_3.csv  20140201_2.csv  20140301_1.csv  20140301_3.csv 20140101_2.csv  20140201_1.csv  20140201_3.csv  

i need combine files names 20140101*.csv rdd work on , on.

i using sc.wholetextfiles read entire directory , grouping filenames patters form string of filenames. passing string sc.textfile open files single rdd.

this code have -

val files = sc.wholetextfiles("*.csv") val indexed_files = files.map(a => (a._1.split("_")(0),a._1)) val data = indexed_files.groupbykey  data.map { =>   var name = a._2.mkstring(",")   (a._1, name) }  data.foreach { =>   var file = sc.textfile(a._2)   println(file.count) } 

and sparkexception - nullpointerexception when try call textfile. error stack refers iterator inside rdd. not able understand error -

15/07/21 15:37:37 info taskschedulerimpl: removed taskset 65.0, tasks have completed, pool org.apache.spark.sparkexception: job aborted due stage failure: task 1 in stage 65.0 failed 4 times, recent failure: lost task 1.3 in stage 65.0 (tid 115, 10.132.8.10): java.lang.nullpointerexception         @ $iwc$$iwc$$iwc$$iwc$$iwc$$iwc$$iwc$$iwc$$anonfun$1.apply(<console>:33)         @ $iwc$$iwc$$iwc$$iwc$$iwc$$iwc$$iwc$$iwc$$anonfun$1.apply(<console>:32)         @ scala.collection.iterator$class.foreach(iterator.scala:727)         @ scala.collection.abstractiterator.foreach(iterator.scala:1157)         @ org.apache.spark.rdd.rdd$$anonfun$foreach$1$$anonfun$apply$28.apply(rdd.scala:870)         @ org.apache.spark.rdd.rdd$$anonfun$foreach$1$$anonfun$apply$28.apply(rdd.scala:870)         @ org.apache.spark.sparkcontext$$anonfun$runjob$5.apply(sparkcontext.scala:1765)         @ org.apache.spark.sparkcontext$$anonfun$runjob$5.apply(sparkcontext.scala:1765) 

however, when sc.textfile(data.first._2).count in spark shell, able form rdd , able retrieve count.

any appreciated.

converting comment answer:

var file = sc.textfile(a._2) 

inside foreach of rdd isn't going work. can't nest rdds that.


Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -