Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-17 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
There is not a way to make huge-count.pl (or count.pl) case insensitive. It
will take the input pretty much "as is" and use that. So, I think you'd
need to lower case your files before they made it to huge-count.pl. You can
use --token to specify how you tokenize words (like do you treat don't as
three tokens (don ' t) or one (don't). --stop lets you exclude words from
being counted, but there isn't anything that lets you ignore case.

On Tue, Apr 17, 2018 at 8:51 AM, Ted Pedersen  wrote:

> Hi Catherine,
>
> Here are a few answers to your questions, hopefully.
>
> I don't think we'll be able to update this code anytime soon - we just
> don't have anyone available to work on that right now, unfortunately. That
> said we are very open to others making contributions, fixes, etc.
>
> The number of files that your system allows is pretty dependent on your
> system and operating system. On Linux you can actually adjust that (if you
> have sudo access) by running
>
> ulimit -s 10
>
> or
>
> ulimit -s unlimited
>
> This increases the number of processes that can be run at any one time,
> which can allow your system to handle more command line arguments (since
> each file name probably causes it's own process to be created...?,
> speculating just a bit there) But if you don't have sudo access this is not
> something you can do.
>
> As far as taking multiple outputs from huge-count.pl and merging them
> with huge-merge, I think the answer is that's almost possible, but not
> quite. huge-merge is not expecting the bigram count that appears on the
> first line of huge-count.pl output to be there, and seems to fail as a
> result. So you would need to remove that first line from your
> huge-count.pl output before merging.
>
> The commands below kind of break down what is happening within
> huge-count.pl. If you run this you can get an idea of the input output
> expected by each stage...
>
> count.pl --tokenlist input1.out input1
>
> count.pl --tokenlist input2.out input2
>
> huge-sort.pl --keep input1.out
>
> huge-sort.pl --keep input2.out
>
> mkdir output-directory
>
> mv input1.out-sorted output-directory
>
> mv input2.out-sorted output-directory
>
> huge-merge.pl --keep output-directory
>
> I hope this helps. I realize it's not exactly a solution, but I hope it's
> helpful all the same. I'll go through your notes again and see if there are
> other issues to address...and of course if you try something and it does or
> doesn't work I'm very interested in hearing about that...
>
> Cordially,
> Ted
>
>
> On Tue, Apr 17, 2018 at 7:33 AM, Ted Pedersen  wrote:
>
>> The good news is that our documentation is more reliable than my memory.
>> :) huge-count treats each file separately and so bigrams do not cross file
>> boundaries. Having verified that I'll get back to your original question..
>> Sorry about the diversion and the confusion that might have caused.
>>
>> More soon,
>> Ted
>>
>> On Mon, Apr 16, 2018 at 4:11 PM, Ted Pedersen  wrote:
>>
>>> Let me go back and revisit this again, I seem to have confused myself!
>>>
>>> More soon,
>>> Ted
>>>
>>> On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram]
>>>  wrote:
>>>


 Did I misread the documentation then?

 "huge-count.pl doesn't consider bigrams at file boundaries. In other
 words,
 the result of count.pl and huge-count.pl on the same data file will
 differ if --newLine is not used, in that, huge-count.pl runs count.pl
 on multiple files separately and thus looses the track of the bigrams
 on file boundaries. With --window not specified, there will be loss
 of one bigram at each file boundary while its W bigrams with --window
 W."

 I thought that means bigrams won't cross from one file to the next?

 If bigrams don't cross from one file to the next, then I just need to
 run huge-count.pl on smaller inputs, then combine. So if I break
 @filenames into smaller subsets, then call huge-count.pl on the
 subsets, then call huge-merge.pl to combine the counts, I think that
 should work.

 I have a few more questions related to usage:

- Do you know how many arguments are allowed for huge-count.pl? It
would be good to know what size chunks I need to split my data into.. 
 Or if
not, then how would I do a try catch block to catch the error "Argument
list to long" from the IPC::System::Simple::system call?
- Is there case-insensitive way to count bigrams, or would I need
to convert all the text to lowercase before calling huge-count.pl?
- Would you consider modifying huge-count.pl so that the user can
specify the final output filename, instead of just automatically calling
the output file complete-huge-count.output?

 Thank you,
 Catherine

 

>>>
>>>
>>
>


Re: [ngram] Re: Using huge-count.pl with lots of files

2018-04-17 Thread Ted Pedersen tpede...@d.umn.edu [ngram]
Hi Catherine,

Here are a few answers to your questions, hopefully.

I don't think we'll be able to update this code anytime soon - we just
don't have anyone available to work on that right now, unfortunately. That
said we are very open to others making contributions, fixes, etc.

The number of files that your system allows is pretty dependent on your
system and operating system. On Linux you can actually adjust that (if you
have sudo access) by running

ulimit -s 10

or

ulimit -s unlimited

This increases the number of processes that can be run at any one time,
which can allow your system to handle more command line arguments (since
each file name probably causes it's own process to be created...?,
speculating just a bit there) But if you don't have sudo access this is not
something you can do.

As far as taking multiple outputs from huge-count.pl and merging them with
huge-merge, I think the answer is that's almost possible, but not quite.
huge-merge is not expecting the bigram count that appears on the first line
of huge-count.pl output to be there, and seems to fail as a result. So you
would need to remove that first line from your huge-count.pl output before
merging.

The commands below kind of break down what is happening within huge-count.pl.
If you run this you can get an idea of the input output expected by each
stage...

count.pl --tokenlist input1.out input1

count.pl --tokenlist input2.out input2

huge-sort.pl --keep input1.out

huge-sort.pl --keep input2.out

mkdir output-directory

mv input1.out-sorted output-directory

mv input2.out-sorted output-directory

huge-merge.pl --keep output-directory

I hope this helps. I realize it's not exactly a solution, but I hope it's
helpful all the same. I'll go through your notes again and see if there are
other issues to address...and of course if you try something and it does or
doesn't work I'm very interested in hearing about that...

Cordially,
Ted


On Tue, Apr 17, 2018 at 7:33 AM, Ted Pedersen  wrote:

> The good news is that our documentation is more reliable than my memory.
> :) huge-count treats each file separately and so bigrams do not cross file
> boundaries. Having verified that I'll get back to your original question.
> Sorry about the diversion and the confusion that might have caused.
>
> More soon,
> Ted
>
> On Mon, Apr 16, 2018 at 4:11 PM, Ted Pedersen  wrote:
>
>> Let me go back and revisit this again, I seem to have confused myself!
>>
>> More soon,
>> Ted
>>
>> On Mon, Apr 16, 2018 at 12:55 PM, catherine.dejage...@gmail.com [ngram] <
>> ngram@yahoogroups.com> wrote:
>>
>>>
>>>
>>> Did I misread the documentation then?
>>>
>>> "huge-count.pl doesn't consider bigrams at file boundaries. In other
>>> words,
>>> the result of count.pl and huge-count.pl on the same data file will
>>> differ if --newLine is not used, in that, huge-count.pl runs count.pl
>>> on multiple files separately and thus looses the track of the bigrams
>>> on file boundaries. With --window not specified, there will be loss
>>> of one bigram at each file boundary while its W bigrams with --window W.."
>>>
>>> I thought that means bigrams won't cross from one file to the next?
>>>
>>> If bigrams don't cross from one file to the next, then I just need to
>>> run huge-count.pl on smaller inputs, then combine. So if I break
>>> @filenames into smaller subsets, then call huge-count.pl on the
>>> subsets, then call huge-merge.pl to combine the counts, I think that
>>> should work.
>>>
>>> I have a few more questions related to usage:
>>>
>>>- Do you know how many arguments are allowed for huge-count.pl? It
>>>would be good to know what size chunks I need to split my data into. Or 
>>> if
>>>not, then how would I do a try catch block to catch the error "Argument
>>>list to long" from the IPC::System::Simple::system call?
>>>- Is there case-insensitive way to count bigrams, or would I need to
>>>convert all the text to lowercase before calling huge-count.pl?
>>>- Would you consider modifying huge-count.pl so that the user can
>>>specify the final output filename, instead of just automatically calling
>>>the output file complete-huge-count.output?
>>>
>>> Thank you,
>>> Catherine
>>>
>>> 
>>>
>>
>>
>