python - Grouping Data into Clusters Based on DataFrame Columns -

- July 15, 2010

i have dataframe (df) resembles following:

a    b    1    2 1    3 1    4 2    5 4    6 4    7 8    9 9    8

i add column determines related cluster based upon values in columns , b:

a    b    c    1    2    1    3    1    4    2    5    3    1    3    2    4    6    4    7    8    9    b 9    8    b

note since 1 (in a) related 2 (in b), , 2 (in a) related 5 (in b), these placed in same cluster. 8 (in a) related 9 (in b) , therefore placed in cluster.

to sum up, how define clusters based upon pairwise connections pairs defined 2 columns in dataframe?

you can view set consolidation problem (with each row describing set) or connected component problem (with each row describing edge between 2 nodes). afaik there's no native support this, although i've considered submitting pr adding utility tools.

anyway, like:

def consolidate(sets):     # http://rosettacode.org/wiki/set_consolidation#python:_iterative     setlist = [s s in sets if s]     i, s1 in enumerate(setlist):         if s1:             s2 in setlist[i+1:]:                 intersection = s1.intersection(s2)                 if intersection:                     s2.update(s1)                     s1.clear()                     s1 = s2     return [s s in setlist if s]  def group_ids(pairs):     groups = consolidate(map(set, pairs))     d = {}     i, group in enumerate(sorted(groups)):         elem in group:             d[elem] =     return d

after have

>>> df["c"] = df["a"].replace(group_ids(zip(df.a, df.b))) >>> df     b  c 0  1  2  0 1  1  3  0 2  1  4  0 3  2  5  0 4  4  6  0 5  4  7  0 6  8  9  1 7  9  8  1

and can replace 0s , 1s whatever want.

Search This Blog

Ruby Co

python - Grouping Data into Clusters Based on DataFrame Columns -

Comments

Post a Comment

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

YouTubePlayerFragment cannot be cast to android.support.v4.app.Fragment -