python - Grouping Data into Clusters Based on DataFrame Columns -
i have dataframe (df) resembles following:
a b 1 2 1 3 1 4 2 5 4 6 4 7 8 9 9 8
i add column determines related cluster based upon values in columns , b:
a b c 1 2 1 3 1 4 2 5 3 1 3 2 4 6 4 7 8 9 b 9 8 b
note since 1 (in a) related 2 (in b), , 2 (in a) related 5 (in b), these placed in same cluster. 8 (in a) related 9 (in b) , therefore placed in cluster.
to sum up, how define clusters based upon pairwise connections pairs defined 2 columns in dataframe?
you can view set consolidation problem (with each row describing set) or connected component problem (with each row describing edge between 2 nodes). afaik there's no native support this, although i've considered submitting pr adding utility tools.
anyway, like:
def consolidate(sets): # http://rosettacode.org/wiki/set_consolidation#python:_iterative setlist = [s s in sets if s] i, s1 in enumerate(setlist): if s1: s2 in setlist[i+1:]: intersection = s1.intersection(s2) if intersection: s2.update(s1) s1.clear() s1 = s2 return [s s in setlist if s] def group_ids(pairs): groups = consolidate(map(set, pairs)) d = {} i, group in enumerate(sorted(groups)): elem in group: d[elem] = return d
after have
>>> df["c"] = df["a"].replace(group_ids(zip(df.a, df.b))) >>> df b c 0 1 2 0 1 1 3 0 2 1 4 0 3 2 5 0 4 4 6 0 5 4 7 0 6 8 9 1 7 9 8 1
and can replace 0s , 1s whatever want.
Comments
Post a Comment