Let me go back and revisit this again, I seem to have confused myself!
More soon,
Ted
On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram] <
ngram@yahoogroups.com> wrote:
>
>
> Did I misread the documentation then?
>
> "huge-count.pl doesn't consider bigrams at file boundaries
Did I misread the documentation then?
"huge-count.pl doesn't consider bigrams at file boundaries. In other words,
the result of count.pl and huge-count.pl on the same data file will
differ if --newLine is not used, in that, huge-count.pl runs count.pl
on multiple files separately and thus loo
Hi Catherine,
There was one thing I wanted to mention about huge-count.pl. When you give
it a list of files as input, it treats those files as one single big file.
So if your goal is to maintain file boundaries (not let bigrams cross that
while letting them cross newlines within a single file) the
with a really large number of files one can use find and xargs:
find . -name '*.txt' | xargs cat
Serge
From: ngram@yahoogroups.com on behalf of Ted Pedersen
tpede...@d.umn.edu [ngram]
Sent: 15 April 2018 23:41:36
To: ngram@yahoogroups.com
Subject: Re: [ngram]