apache pig - How to check COUNT of filtered elements in PIG -


i have following data set in need perform steps based on car's company name.

            (23,nissan,12.43)             (23,nissan car,16.43)             (23,honda car,13.23)             (23,toyota car,17.0)             (24,honda,45.0)             (24,toyota,12.43)             (24,nissan car,12.43)             = load 'data.txt' (code:int, name:chararray, rating:double);           g = group (code, regex_extract(name,'(?i)(^.+?\\b)\\s*(car)*$',1));             dump g; 

i grouping cars based on code , base company name 'nissan' , 'nissan car' records should come in 1 group , similar others.

    /* grouped data based on code , company's first name*/              ((23,nissan),{(23,nissan,12.43),(23,nissan car,16.43)})             ((23,honda),{(23,honda car,13.23)})             ((23,toyota),{(23,toyota car,17.0)})             ((24,nissan),{(24,nissan car,12.43)})             ((24,honda),{(24,honda,45.0)})             ((24,toyota),{(24,toyota,12.43)}) 

now, want filter out groups based on whether contain tuple corresponding group's name. if yes, take tuple group , ignore others , if no such tuple exists take tuples group.

the output should be:

            ((23,nissan),{(23,nissan,12.43)})  // since group contains row group's name i.e. nissan             ((23,honda),{(23,honda car,13.23)})             ((23,toyota),{(23,toyota car,17.0)})             ((24,nissan),{(24,nissan car,12.43)})             ((24,honda),{(24,honda,45.0)})             ((24,toyota),{(24,toyota,12.43)})              r = foreach g { ow = filter name==group.$1; if count(ow) > 0} 

could please how can this? after filtering group's name? how can find count of filtered tuples , required data.

ok. lets consider below records input.

23,nissan,12.43 23,nissan car,16.43 23,honda car,13.23 23,toyota car,17.0 24,honda,45.0 24,toyota,12.43 25,toyato car,23.8 25,toyato car,17.2 24,nissan car,12.43  

for above input , let below intermediate output

((23,honda),{(23,honda,honda car,13.23)}) ((23,nissan),{(23,nissan,nissan,12.43),(23,nissan,nissan car,16.43)}) ((23,toyota),{(23,toyota,toyota car,17.0)}) ((24,honda),{(24,honda,honda,45.0)}) ((24,nissan),{(24,nissan,nissan car,12.43)}) ((24,toyota),{(24,toyota,toyota,12.43)}) ((25,toyato),{(25,toyato,toyato car,23.8),(25,toyato,toyato car,17.2)}) 

just consider, above intermediate output, looking below output per requirement .

(23,honda,1) (23,nissan,1) (23,toyota,1) (24,honda,1) (24,nissan,1) (24,toyota,1) (25,toyato,2) 

below code..

nissan_load = load '/user/cloudera/inputfiles/nissan.txt' using pigstorage(',') as(code:int,name:chararray,rating:double);  nissan_each = foreach nissan_load generate code,trim(regex_extract(name,'(?i)(^.+?\\b)\\s*(car)*$',1)) brand_name,name,rating;  nissan_grp = group nissan_each (code,brand_name);   nissan_final_each =foreach nissan_grp {              = foreach nissan_each generate (brand_name == trim(name) ? 1 :0) cnt;              b = (int)sum(a);               c = foreach nissan_each  generate (brand_name != trim(name) ?1: 0) extra_cnt;              d = sum(c);               generate flatten(group) as(code,brand_name), (sum(a.cnt) != 0 ? b : d) final_cnt;  };   dump nissan_final_each; 

try code different inputs well..


Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -