Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-16 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Let me go back and revisit this again, I seem to have confused myself! More soon, Ted On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram] < ngram@yahoogroups.com> wrote: > > > Did I misread the documentation then? > > "huge-count.pl doesn't consider bigrams at file boundaries

[ngram] Re: Using huge-count.pl with lots of files

2018-04-16 Thread catherine.dejage...@gmail.com [ngram]
Did I misread the documentation then? "huge-count.pl doesn't consider bigrams at file boundaries. In other words, the result of count.pl and huge-count.pl on the same data file will differ if --newLine is not used, in that, huge-count.pl runs count.pl on multiple files separately and thus loo

Re: [ngram] Using huge-count.pl with lots of files

2018-04-16 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Catherine, There was one thing I wanted to mention about huge-count.pl. When you give it a list of files as input, it treats those files as one single big file. So if your goal is to maintain file boundaries (not let bigrams cross that while letting them cross newlines within a single file) the

Re: [ngram] Using huge-count.pl with lots of files

2018-04-16 Thread Serge Sharoff s.shar...@leeds.ac.uk [ngram]
with a really large number of files one can use find and xargs: find . -name '*.txt' | xargs cat Serge From: ngram@yahoogroups.com on behalf of Ted Pedersen tpede...@d.umn.edu [ngram] Sent: 15 April 2018 23:41:36 To: ngram@yahoogroups.com Subject: Re: [ngram]