bash - limit text files to a certain word length, but keep complete sentences -
i have corpus of text files need copy, limiting each file same word length, while maintaining complete sentences. treating punctuation within {.?!}
sentence boundary acceptable. python, trying learn bash, suggestions welcome. approach have been considering overshoot target word length few words , trim result last sentence boundary.
i familiar head
, wc
, can't come way combine two. man
file head
not indicate way use word-counts, , man
file wc
not indicate way split file.
context: working on text classification task machine-learning (using weka
, record). want make sure text length (which varies in data) not influencing outcomes much. this, trying normalize text lengths before perform feature extraction.
let's consider test file:
$ cat file exist? program. therefore, am!
suppose want truncate file complete sentences of 20 characters or fewer:
$ awk -v n=20 -v rs='[.?!]' '{if (length(s $0 rt)>n) exit; else s=s $0 rt;} end{print s;}' file exist?
if want 30 characters or fewer:
$ awk -v n=30 -v rs='[.?!]' '{if (length(s $0 rt)>n) exit; else s=s $0 rt;} end{print s;}' file exist? program.
how works
-v n=20
this sets awk variable
n
max length want (not counting file's final newline character).-v rs='[.?!]'
this sets awk record separator,
rs
, of 3 characters mentioned.if (length(s $0 rt)>n) exit; else s=s $0 rt
for each record in file (a record being sentence), test see if adding
s
make output long. if makes output long, exit. if not, adds
.in awk,
$0
represents complete record ,rt
record separator awk found @ end of record.end{print s;}
before exit, prints string
s
.
alternate 1: truncating based on number of words
suppose instead want truncate based on number of words. if want, example, 6 words:
$ awk -v n=6 -v rs='[[:space:]]+' 'nr>n{exit;} {printf "%s%s",$0,rt;} end{print"";}' file exist? program. therefore,
the difference know used whitespace record separator. in way, each record word , keep printing words until reach limit.
alternative 2: whole sentences limited number of words
$ awk -v n=6 -v rs='[.?!]' '{c+=nf; if (c>n) exit; else s=s $0 rt;} end{print s;}' file exist? program.
mac osx
the above sets record separator, rs
, regular expression. may require gnu awk (gawk). osx man page awk
not whether feature supported or not. @bebop, however, reports above code can run on osx after installing gawk
macports.
Comments
Post a Comment