detect sequence in hive column with lead function -


i'm trying detect sequence in column of hive table. have 3 columns (id, label, index). each id has sequence of labels , index ordering of labels, like

id  label   index   x   1   y   2   x   3   y   4 b   x   1 b   y   2 b   y   3 b   y   4 b   x   5 b   y   6 

i want identify if label sequence of x,y,x,y occurs.
thinking of trying lead function accomplish like:

select id, index, label, lead( label, 1) on (partition id order index) l1_fac, lead( label, 2) on (partition id order index) l2_fac, lead( label, 3) on (partition id order index) l3_fac mytable 

yields:

id  index  label  l1_fac  l2_fac  l3_fac  1  x  y  x  y  2  y  x  y  null  3  x  y  null  null  4  y  null  null  null b  1  x  y  y  y b  2  y  y  y  x b  3  y  y  x  y b  4  y  x  y  null b  5  x  y  null  null 

where l1(2,3) next label values. check pattern

where label = l2_fac , l1_fac = l3_fac 

this work id = a, not id = b label sequence is: x, y, y, y, y, x. don't care 3 y's in row interested went x y x y.

i'm not sure if possible, trying combination of group , partition, not successful.

i answered this question op wanted collect items list , remove repeating items. think want do. extract actual xyxy sequences , account second example xyxy occurs, clouded 2 ys. need collect label column array using this udaf -- preserve order -- use udf referenced, can use concat_ws make contents of array string, , lastly, check string occurrence of desired sequence. function instr spit out location of first occurrence , 0 if never finds string.

query:

add jar /path/to/jars/brickhouse-0.7.1.jar; add jar /path/to/other/jar/duplicates.jar;  create temporary function remove_seq_dups 'com.something.removesequentialduplicates'; create temporary function collect 'brickhouse.udf.collect.collectudaf';  select id, label_string, instr('xyxy', label_string) str_flg (   select id, concat_ws('', no_dups) label_string   (     select id, remove_seq_dups(label_array) no_dups     (         select id, collect(label) label_array         db.table         group id ) x         ) y      ) z 

output:

id   label_string    str_flg ============================    xyxy            1 b    xyxy            1 

a better alternative might collect label udf, make string, , regex out sequence xyxy i'm pretty terrible @ regex possibly else can comment intelligently on this.


Comments

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

YouTubePlayerFragment cannot be cast to android.support.v4.app.Fragment -