bioinformatics - Python-remove highly similar string from dataset -


i have genomic dataset contained base messages, this:

position samp1 samp2 samp2 samp3 samp4 samp5 samp6 ...
posa t t t t t t t ...
posb g g g ...
posc g g g g g g g ...
...

this file has 100000+ lines, each line contains 200 bases of 200 samples.
want remove positons haves high similar base in every samples, pic below of 100 % same, , remove 1 of them
similar positions

we defined similar ratio (similar base number) / (sequence length):

posh c c c c c c c c
posi c c c c c c

similarity of posh , posi 6 / 8 = 75% required, similar ratio above 99% regarded highly similay, , remove 1 of similar positions.

how can work in python efficiently? thank you.

similarity of 6/8 between posh , posi, looks want inverse of normalized hamming distance (i.e. 1-d).

you can compute inverse normalized hamming distance between 2 sequences using:

def inverse_hamming_distance(a,b):     z = list(zip(a, b))     return sum(e[0]==e[1] e in z) / len(z) 

and gives:

>>> inverse_hamming_distance('cccccccc', 'acccaccc') 0.75 

however can save cpu cycle early detecting 2 lines not similar. given minimum similarity threshold t, if observe int(0.5+(1-t)*len(z)) dissimilar items, don't need go til end, , can tell items not similar.

def similar(a,b,t=0.99):     l = min(len(a), len(b))     t = int(0.5 + l*(1 - t))     n = 0     a1, b1 in zip(a, b):         if a1 != b1:             n += 1         if n > t:             return false     return true 

test:

>>> similar('cccccccc', 'acccaccc', 0.75) true >>> similar('cccccccc', 'acccaccc', 0.9) false 

Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -