python - How to select DataFrame columns based on partial matching? -


i struggling afternoon find way of selecting few columns of pandas dataframe, checking occurrence of pattern in name (label?).

i had been looking contains or isin nd.arrays / pd.series, got no luck.

this frustrated me quite bit, checking columns of dataframe occurrences of specific string patterns, in:

hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text')) df_cln= df[hp] 

however, no matter how banged head, not apply .str.contains() object returned bydf.columns - index - nor 1 returned df.columns.values - ndarray. works fine returned "slicing" operation df[column_name], i.e. series, though.

my first solution involved for loop , creation of list:

ll = [] in df.columns:     if a.startswith('start_exp1') | a.startswith('start_exp2'):     ll.append(a) df[ll] 

(one apply of str functions, of course)

then, found map function , got work following code:

import re sel = df.columns.map(lambda x: bool(re.search('your_regex',x)) df[df.columns[sel]] 

of course in first solution have performed same kind of regex checking, because can apply str data type returned iteration.

i new python , never programmed not familiar speed/timing/efficiency, tend think second method - using map - potentially faster, besides looking more elegant untrained eye.

i curious know think of it, , possible alternatives be. given level of noobness, appreciate if correct mistakes have made in code , point me in right direction.

thanks, michele

edit : found index method index.to_series(), returns - ehm - series apply .str.contains('whatever'). however, not quite powerful true regex, , not find way of passing result of index.to_series().str re.search() function..

your solution using map good. if want use str.contains, possible convert index objects series (which have str.contains method):

in [1]: df out[1]:     x  y  z 0  0  0  0 1  1  1  1 2  2  2  2 3  3  3  3 4  4  4  4 5  5  5  5 6  6  6  6 7  7  7  7 8  8  8  8 9  9  9  9  in [2]: df.columns.to_series().str.contains('x') out[2]:  x     true y    false z    false dtype: bool  in [3]: df[df.columns[df.columns.to_series().str.contains('x')]] out[3]:     x 0  0 1  1 2  2 3  3 4  4 5  5 6  6 7  7 8  8 9  9 

update read last paragraph. documentation, str.contains allows pass regex default (str.contains('^myregex'))


Comments

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

python - Healpy: From Data to Healpix map -