scala - Efficient PairRDD operations on DataFrame with Spark SQL GROUP BY -
this question duality between dataframe
, rdd
when comes aggregation operations. in spark sql 1 can use table generating udfs custom aggregations creating 1 of typically noticeably less user-friendly using aggregation functions available rdds, if table output not required.
is there efficient way apply pair rdd operations such aggregatebykey
dataframe has been grouped using group or ordered using ordered by?
normally, 1 need explicit map
step create key-value tuples, e.g., dataframe.rdd.map(row => (row.getstring(row.fieldindex("category")), row).aggregatebykey(...)
. can avoided?
not really. while dataframes
can converted rdds
, vice versa relatively complex operation , methods dataframe.groupby
don't have same semantics counterparts on rdd
.
the closest thing can a new dataset
api introduced in spark 1.6.0. provides closer integration dataframes
, groupeddataset
class own set of methods including reduce
, cogroup
or mapgroups
:
case class record(id: long, key: string, value: double) val df = sc.parallelize(seq( (1l, "foo", 3.0), (2l, "bar", 5.6), (3l, "foo", -1.0), (4l, "bar", 10.0) )).todf("id", "key", "value") val ds = df.as[record] ds.groupby($"key").reduce((x, y) => if (x.id < y.id) x else y).show // +-----+-----------+ // | _1| _2| // +-----+-----------+ // |[bar]|[2,bar,5.6]| // |[foo]|[1,foo,3.0]| // +-----+-----------+
in specific cases possible leverage orderable
semantics group , process data using structs
or arrays
. you'll find example in spark dataframe: select first row of each group
Comments
Post a Comment