detect sequence in hive column with lead function -
i'm trying detect sequence in column of hive table. have 3 columns (id, label, index). each id has sequence of labels , index ordering of labels, like
id label index x 1 y 2 x 3 y 4 b x 1 b y 2 b y 3 b y 4 b x 5 b y 6
i want identify if label sequence of x,y,x,y occurs.
thinking of trying lead function accomplish like:
select id, index, label, lead( label, 1) on (partition id order index) l1_fac, lead( label, 2) on (partition id order index) l2_fac, lead( label, 3) on (partition id order index) l3_fac mytable
yields:
id index label l1_fac l2_fac l3_fac 1 x y x y 2 y x y null 3 x y null null 4 y null null null b 1 x y y y b 2 y y y x b 3 y y x y b 4 y x y null b 5 x y null null
where l1(2,3) next label values. check pattern
where label = l2_fac , l1_fac = l3_fac
this work id = a, not id = b label sequence is: x, y, y, y, y, x. don't care 3 y's in row interested went x y x y.
i'm not sure if possible, trying combination of group , partition, not successful.
i answered this question op wanted collect items list , remove repeating items. think want do. extract actual xyxy
sequences , account second example xyxy
occurs, clouded 2 y
s. need collect label
column array using this udaf -- preserve order -- use udf referenced, can use concat_ws
make contents of array string, , lastly, check string occurrence of desired sequence. function instr
spit out location of first occurrence , 0 if never finds string.
query:
add jar /path/to/jars/brickhouse-0.7.1.jar; add jar /path/to/other/jar/duplicates.jar; create temporary function remove_seq_dups 'com.something.removesequentialduplicates'; create temporary function collect 'brickhouse.udf.collect.collectudaf'; select id, label_string, instr('xyxy', label_string) str_flg ( select id, concat_ws('', no_dups) label_string ( select id, remove_seq_dups(label_array) no_dups ( select id, collect(label) label_array db.table group id ) x ) y ) z
output:
id label_string str_flg ============================ xyxy 1 b xyxy 1
a better alternative might collect label
udf, make string, , regex out sequence xyxy
i'm pretty terrible @ regex possibly else can comment intelligently on this.
Comments
Post a Comment