bioinformatics - Python-remove highly similar string from dataset -
i have genomic dataset contained base messages, this:
position samp1 samp2 samp2 samp3 samp4 samp5 samp6 ...
posa t t t t t t t ...
posb g g g ...
posc g g g g g g g ...
...
this file has 100000+ lines, each line contains 200 bases of 200 samples.
want remove positons haves high similar base in every samples, pic below of 100 % same, , remove 1 of them
we defined similar ratio (similar base number) / (sequence length):
posh c c c c c c c c
posi c c c c c c
similarity of posh , posi 6 / 8 = 75% required, similar ratio above 99% regarded highly similay, , remove 1 of similar positions.
how can work in python efficiently? thank you.
similarity of 6/8
between posh
, posi
, looks want inverse of normalized hamming distance (i.e. 1-d
).
you can compute inverse normalized hamming distance between 2 sequences using:
def inverse_hamming_distance(a,b): z = list(zip(a, b)) return sum(e[0]==e[1] e in z) / len(z)
and gives:
>>> inverse_hamming_distance('cccccccc', 'acccaccc') 0.75
however can save cpu cycle early detecting 2 lines not similar. given minimum similarity threshold t
, if observe int(0.5+(1-t)*len(z))
dissimilar items, don't need go til end, , can tell items not similar.
def similar(a,b,t=0.99): l = min(len(a), len(b)) t = int(0.5 + l*(1 - t)) n = 0 a1, b1 in zip(a, b): if a1 != b1: n += 1 if n > t: return false return true
test:
>>> similar('cccccccc', 'acccaccc', 0.75) true >>> similar('cccccccc', 'acccaccc', 0.9) false
Comments
Post a Comment