Data preprocessing with apache spark and scala -


i'm pretty new spark , scala , therefore have questions concerning data preprocessing spark , working rdds. i'm working on little project , want implement machine learning system spark. working algorithms ok think have problems preprocessing data. have dataset 30 columns , 1 million rows. simplicity lets assume have following dataset (csv-file):

columna, columnb, column_txt, label 1      ,      , abc       , 0 2      ,        , abc       , 0 3      , b      , abc       , 1 4      , b      , abc       , 1 5      ,      , abc       , 0 6      ,        , abc       , 0 7      , c      , abc       , 1 8      ,      , abc       , 1 9      , b      , abc       , 1 10     , c      , abc       , 0 

after loading data in spark want following steps:

  1. remove columns end "_txt"
  2. filter out rows columnb empty (this figured out already)
  3. delete columns have more 9 levels (here columna)

so have problems issue 1. , 3. know can't remove columns have create new rdd how do without columns? i'm loading csv file without header in spark tasks need to. recommendable load header in separate rdd? how can interact rdd find right columns then? sorry, know lots of questions i'm still @ beginning , trying learn. , best regards, chris

assuming data frame loaded headers , structure flat:

val df = sqlcontext.     read.     format("com.databricks.spark.csv").     option("header", "true").     load("data.csv") 

something should work:

import org.apache.spark.sql.dataframe  def morethan9(df: dataframe, col: string) = {     df.agg(countdistinct(col)).first()(0) match {         case x: long => x > 9l         case _ => false     } }  val newdf = df.     schema. //  extract schema     toarray. // convert array     map(_.name). // map names     foldleft(df)((df: dataframe, col: string) => {         if (col.endswith("_txt") | morethan9(df, col)) df.drop(col) else df     }) 

if loaded without header can same thing using mapping automatically assigned ones actual.


Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -