scala - How to use regex to include/exclude some input files in sc.textFile? -
i have attempted filter out dates specific files using apache spark inside file rdd function sc.textfile()
.
i have attempted following:
sc.textfile("/user/orders/201507(2[7-9]{1}|3[0-1]{1})*")
this should match following:
/user/orders/201507270010033.gz /user/orders/201507300060052.gz
any idea how achieve this?
looking @ the accepted answer, seems use form of glob syntax. reveals api exposure of hadoop's fileinputformat
.
searching reveals paths supplied fileinputformat
's addinputpath
or setinputpath
"may represent file, directory, or, using glob, collection of files , directories". perhaps, sparkcontext
uses apis set path.
the syntax of glob includes:
*
(match 0 or more character)?
(match single character)[ab]
(character class)[^ab]
(negated character class)[a-b]
(character range){a,b}
(alternation)\c
(escape character)
following example in accepted answer, possible write path as:
sc.textfile("/user/orders/2015072[7-9]*,/user/orders/2015073[0-1]*")
it's not clear how alternation syntax can used here, since comma used delimit list of paths (as shown above). according zero323's comment, no escaping necessary:
sc.textfile("/user/orders/201507{2[7-9],3[0-1]}*")
Comments
Post a Comment