scala - Efficient PairRDD operations on DataFrame with Spark SQL GROUP BY -


this question duality between dataframe , rdd when comes aggregation operations. in spark sql 1 can use table generating udfs custom aggregations creating 1 of typically noticeably less user-friendly using aggregation functions available rdds, if table output not required.

is there efficient way apply pair rdd operations such aggregatebykey dataframe has been grouped using group or ordered using ordered by?

normally, 1 need explicit map step create key-value tuples, e.g., dataframe.rdd.map(row => (row.getstring(row.fieldindex("category")), row).aggregatebykey(...). can avoided?

not really. while dataframes can converted rdds , vice versa relatively complex operation , methods dataframe.groupby don't have same semantics counterparts on rdd.

the closest thing can a new dataset api introduced in spark 1.6.0. provides closer integration dataframes , groupeddataset class own set of methods including reduce, cogroup or mapgroups:

case class record(id: long, key: string, value: double)  val df = sc.parallelize(seq(     (1l, "foo", 3.0), (2l, "bar", 5.6),     (3l, "foo", -1.0), (4l, "bar", 10.0) )).todf("id", "key", "value")  val ds = df.as[record] ds.groupby($"key").reduce((x, y) => if (x.id < y.id) x else y).show  // +-----+-----------+ // |   _1|         _2| // +-----+-----------+ // |[bar]|[2,bar,5.6]| // |[foo]|[1,foo,3.0]| // +-----+-----------+ 

in specific cases possible leverage orderable semantics group , process data using structs or arrays. you'll find example in spark dataframe: select first row of each group


Comments

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

python - Healpy: From Data to Healpix map -