bash - limit text files to a certain word length, but keep complete sentences -
i have corpus of text files need copy, limiting each file same word length, while maintaining complete sentences. treating punctuation within {.?!} sentence boundary acceptable. python, trying learn bash, suggestions welcome. approach have been considering overshoot target word length few words , trim result last sentence boundary.
i familiar head , wc, can't come way combine two. man file head not indicate way use word-counts, , man file wc not indicate way split file.
context: working on text classification task machine-learning (using weka, record). want make sure text length (which varies in data) not influencing outcomes much. this, trying normalize text lengths before perform feature extraction.
let's consider test file:
$ cat file exist? program. therefore, am! suppose want truncate file complete sentences of 20 characters or fewer:
$ awk -v n=20 -v rs='[.?!]' '{if (length(s $0 rt)>n) exit; else s=s $0 rt;} end{print s;}' file exist? if want 30 characters or fewer:
$ awk -v n=30 -v rs='[.?!]' '{if (length(s $0 rt)>n) exit; else s=s $0 rt;} end{print s;}' file exist? program. how works
-v n=20this sets awk variable
nmax length want (not counting file's final newline character).-v rs='[.?!]'this sets awk record separator,
rs, of 3 characters mentioned.if (length(s $0 rt)>n) exit; else s=s $0 rtfor each record in file (a record being sentence), test see if adding
smake output long. if makes output long, exit. if not, adds.in awk,
$0represents complete record ,rtrecord separator awk found @ end of record.end{print s;}before exit, prints string
s.
alternate 1: truncating based on number of words
suppose instead want truncate based on number of words. if want, example, 6 words:
$ awk -v n=6 -v rs='[[:space:]]+' 'nr>n{exit;} {printf "%s%s",$0,rt;} end{print"";}' file exist? program. therefore, the difference know used whitespace record separator. in way, each record word , keep printing words until reach limit.
alternative 2: whole sentences limited number of words
$ awk -v n=6 -v rs='[.?!]' '{c+=nf; if (c>n) exit; else s=s $0 rt;} end{print s;}' file exist? program. mac osx
the above sets record separator, rs, regular expression. may require gnu awk (gawk). osx man page awk not whether feature supported or not. @bebop, however, reports above code can run on osx after installing gawk macports.
Comments
Post a Comment