apache pig - How to check COUNT of filtered elements in PIG -
i have following data set in need perform steps based on car's company name.
(23,nissan,12.43) (23,nissan car,16.43) (23,honda car,13.23) (23,toyota car,17.0) (24,honda,45.0) (24,toyota,12.43) (24,nissan car,12.43) = load 'data.txt' (code:int, name:chararray, rating:double); g = group (code, regex_extract(name,'(?i)(^.+?\\b)\\s*(car)*$',1)); dump g;
i grouping cars based on code , base company name 'nissan' , 'nissan car' records should come in 1 group , similar others.
/* grouped data based on code , company's first name*/ ((23,nissan),{(23,nissan,12.43),(23,nissan car,16.43)}) ((23,honda),{(23,honda car,13.23)}) ((23,toyota),{(23,toyota car,17.0)}) ((24,nissan),{(24,nissan car,12.43)}) ((24,honda),{(24,honda,45.0)}) ((24,toyota),{(24,toyota,12.43)})
now, want filter out groups based on whether contain tuple corresponding group's name. if yes, take tuple group , ignore others , if no such tuple exists take tuples group.
the output should be:
((23,nissan),{(23,nissan,12.43)}) // since group contains row group's name i.e. nissan ((23,honda),{(23,honda car,13.23)}) ((23,toyota),{(23,toyota car,17.0)}) ((24,nissan),{(24,nissan car,12.43)}) ((24,honda),{(24,honda,45.0)}) ((24,toyota),{(24,toyota,12.43)}) r = foreach g { ow = filter name==group.$1; if count(ow) > 0}
could please how can this? after filtering group's name? how can find count of filtered tuples , required data.
ok. lets consider below records input.
23,nissan,12.43 23,nissan car,16.43 23,honda car,13.23 23,toyota car,17.0 24,honda,45.0 24,toyota,12.43 25,toyato car,23.8 25,toyato car,17.2 24,nissan car,12.43
for above input , let below intermediate output
((23,honda),{(23,honda,honda car,13.23)}) ((23,nissan),{(23,nissan,nissan,12.43),(23,nissan,nissan car,16.43)}) ((23,toyota),{(23,toyota,toyota car,17.0)}) ((24,honda),{(24,honda,honda,45.0)}) ((24,nissan),{(24,nissan,nissan car,12.43)}) ((24,toyota),{(24,toyota,toyota,12.43)}) ((25,toyato),{(25,toyato,toyato car,23.8),(25,toyato,toyato car,17.2)})
just consider, above intermediate output, looking below output per requirement .
(23,honda,1) (23,nissan,1) (23,toyota,1) (24,honda,1) (24,nissan,1) (24,toyota,1) (25,toyato,2)
below code..
nissan_load = load '/user/cloudera/inputfiles/nissan.txt' using pigstorage(',') as(code:int,name:chararray,rating:double); nissan_each = foreach nissan_load generate code,trim(regex_extract(name,'(?i)(^.+?\\b)\\s*(car)*$',1)) brand_name,name,rating; nissan_grp = group nissan_each (code,brand_name); nissan_final_each =foreach nissan_grp { = foreach nissan_each generate (brand_name == trim(name) ? 1 :0) cnt; b = (int)sum(a); c = foreach nissan_each generate (brand_name != trim(name) ?1: 0) extra_cnt; d = sum(c); generate flatten(group) as(code,brand_name), (sum(a.cnt) != 0 ? b : d) final_cnt; }; dump nissan_final_each;
try code different inputs well..
Comments
Post a Comment