scala - How to use regex to include/exclude some input files in sc.textFile? -


i have attempted filter out dates specific files using apache spark inside file rdd function sc.textfile().

i have attempted following:

sc.textfile("/user/orders/201507(2[7-9]{1}|3[0-1]{1})*") 

this should match following:

/user/orders/201507270010033.gz /user/orders/201507300060052.gz 

any idea how achieve this?

looking @ the accepted answer, seems use form of glob syntax. reveals api exposure of hadoop's fileinputformat.

searching reveals paths supplied fileinputformat's addinputpath or setinputpath "may represent file, directory, or, using glob, collection of files , directories". perhaps, sparkcontext uses apis set path.

the syntax of glob includes:

  • * (match 0 or more character)
  • ? (match single character)
  • [ab] (character class)
  • [^ab] (negated character class)
  • [a-b] (character range)
  • {a,b} (alternation)
  • \c (escape character)

following example in accepted answer, possible write path as:

sc.textfile("/user/orders/2015072[7-9]*,/user/orders/2015073[0-1]*") 

it's not clear how alternation syntax can used here, since comma used delimit list of paths (as shown above). according zero323's comment, no escaping necessary:

sc.textfile("/user/orders/201507{2[7-9],3[0-1]}*") 

Comments

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

YouTubePlayerFragment cannot be cast to android.support.v4.app.Fragment -