python - How to select DataFrame columns based on partial matching? -
i struggling afternoon find way of selecting few columns of pandas dataframe, checking occurrence of pattern in name (label?).
i had been looking contains
or isin
nd.arrays
/ pd.series
, got no luck.
this frustrated me quite bit, checking columns of dataframe
occurrences of specific string patterns, in:
hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text')) df_cln= df[hp]
however, no matter how banged head, not apply .str.contains()
object returned bydf.columns
- index
- nor 1 returned df.columns.values
- ndarray
. works fine returned "slicing" operation df[column_name]
, i.e. series
, though.
my first solution involved for
loop , creation of list:
ll = [] in df.columns: if a.startswith('start_exp1') | a.startswith('start_exp2'): ll.append(a) df[ll]
(one apply of str
functions, of course)
then, found map
function , got work following code:
import re sel = df.columns.map(lambda x: bool(re.search('your_regex',x)) df[df.columns[sel]]
of course in first solution have performed same kind of regex checking, because can apply str
data type returned iteration.
i new python , never programmed not familiar speed/timing/efficiency, tend think second method - using map - potentially faster, besides looking more elegant untrained eye.
i curious know think of it, , possible alternatives be. given level of noobness, appreciate if correct mistakes have made in code , point me in right direction.
thanks, michele
edit : found index
method index.to_series()
, returns - ehm - series
apply .str.contains('whatever')
. however, not quite powerful true regex, , not find way of passing result of index.to_series().str
re.search()
function..
your solution using map
good. if want use str.contains, possible convert index objects series (which have str.contains
method):
in [1]: df out[1]: x y z 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 9 9 9 9 in [2]: df.columns.to_series().str.contains('x') out[2]: x true y false z false dtype: bool in [3]: df[df.columns[df.columns.to_series().str.contains('x')]] out[3]: x 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9
update read last paragraph. documentation, str.contains
allows pass regex default (str.contains('^myregex')
)
Comments
Post a Comment