bash - limit text files to a certain word length, but keep complete sentences -


i have corpus of text files need copy, limiting each file same word length, while maintaining complete sentences. treating punctuation within {.?!} sentence boundary acceptable. python, trying learn bash, suggestions welcome. approach have been considering overshoot target word length few words , trim result last sentence boundary.

i familiar head , wc, can't come way combine two. man file head not indicate way use word-counts, , man file wc not indicate way split file.

context: working on text classification task machine-learning (using weka, record). want make sure text length (which varies in data) not influencing outcomes much. this, trying normalize text lengths before perform feature extraction.

let's consider test file:

$ cat file exist? program. therefore, am! 

suppose want truncate file complete sentences of 20 characters or fewer:

$ awk -v n=20 -v rs='[.?!]' '{if (length(s $0 rt)>n) exit; else s=s $0 rt;} end{print s;}' file exist? 

if want 30 characters or fewer:

$ awk -v n=30 -v rs='[.?!]' '{if (length(s $0 rt)>n) exit; else s=s $0 rt;} end{print s;}' file exist? program. 

how works

  • -v n=20

    this sets awk variable n max length want (not counting file's final newline character).

  • -v rs='[.?!]'

    this sets awk record separator, rs, of 3 characters mentioned.

  • if (length(s $0 rt)>n) exit; else s=s $0 rt

    for each record in file (a record being sentence), test see if adding s make output long. if makes output long, exit. if not, add s.

    in awk, $0 represents complete record , rt record separator awk found @ end of record.

  • end{print s;}

    before exit, prints string s.

alternate 1: truncating based on number of words

suppose instead want truncate based on number of words. if want, example, 6 words:

$ awk -v n=6 -v rs='[[:space:]]+' 'nr>n{exit;} {printf "%s%s",$0,rt;} end{print"";}' file exist? program. therefore,  

the difference know used whitespace record separator. in way, each record word , keep printing words until reach limit.

alternative 2: whole sentences limited number of words

$ awk -v n=6 -v rs='[.?!]' '{c+=nf; if (c>n) exit; else s=s $0 rt;} end{print s;}' file exist? program. 

mac osx

the above sets record separator, rs, regular expression. may require gnu awk (gawk). osx man page awk not whether feature supported or not. @bebop, however, reports above code can run on osx after installing gawk macports.


Comments

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

python - Healpy: From Data to Healpix map -