SPARK SQL Equivalent of Qualify + Row_number statements -


does know best way apache spark sql achieve same results standard sql qualify() + rnk or row_number statements?

for example:

  • i have spark dataframe called statement_data 12 monthly records each 100 unique account_numbers, therefore 1200 records in total
  • each monthly record has field called "statement_date" can used determining recent record

i want final result new spark dataframe 3 recent records (as determined statement_date descending) each of 100 unique account_numbers, therefore 300 final records in total.

in standard teradata sql, can following:

select * statement_data qualify row_number () over(partition acct_id order statement_date desc) <= 3 

apache spark sql not have standalone qualify function i'm aware of, maybe i'm screwing syntax or can't find documentation qualify exists.

it fine if need in 2 steps long 2 steps are:

  • a select query or alternative method assign rank/row numbering each account_number's records
  • a select query i'm selecting records rank <= 3 (i.e. choose 1st, 2nd, , 3rd recent records).

edit 1 - 7/23 2:09pm: initial solution provided zero323 not working me in spark 1.4.1 spark sql 1.4.1 dependency installed.

edit 2 - 7/23 3:24pm: turns out error related using sql context objects query instead of hive context. able run below solution correctly after adding following code create , use hive context:

final javasparkcontext sc2; final hivecontext hc2; dataframe df; hc2 = testhive$.module$; sc2 = new javasparkcontext(hc2.sparkcontext());  .... // initial spark/sql contexts set dataframes   sparkconf conf = new sparkconf().setappname("statement test"); ... dataframe stmtsummary =      hc2.sql("select * (select acct_id, stmt_end_dt, stmt_curr_bal, row_number() on (partition acct_id order stmt_curr_bal desc) rank_num stmt_data) tmp rank_num <= 3"); 

there no qualify (it useful check parser source) can use subquery this:

select * (     select *, row_number() on (         partition acct_id order statement_date desc     ) rank df  ) tmp rank <= 3 

see spark : failure: ``union'' expected `(' found


Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -