date:20090701

[ngram] Ngrams without line break

2009-07-01 Thread mercevg

Dear all, I would like to know if it's possible to get ngrams without containing line breaks from the corpus. I'll try to explain clearly: if the input text file is first line of text second line And a third line of text Then, we'll get with count.pl two bigrams containing like breaks:

Re: [ngram] Ngrams without line break

2009-07-01 Thread Ted Pedersen

Greetings Merce, To make sure I understand correctly, it sounds like you *only* want to see those ngrams that contain a line break. For example, if you run count.pl as follows on your test file first line of text second line And a third line of text count.pl test.out test talisker(8): more

[ngram] Re: Ngrams without line break

2009-07-01 Thread mercevg

Dear Ted, In my case, I would like to get all the ngrams except those that cross over the end of line. In your example: the cat is my friend the cat is my friend I don't want to get as ngrams is my and the cat, those having a new line in the middle of it. As you said, by default count.pl

Re: [ngram] Re: Ngrams without line break

2009-07-01 Thread Ted Pedersen

Hi Merce, Ah, now I understand. Fortunately there is a simple answer, I think. count.pl cattest.out cattest --newLine will cause the end of line markers to be respected, so ngrams will NOT cross over them. talisker(56): more cattest.out 7 catis2 2 2 myfriend2 2 2 friendthe1 1 1 ismy1 1 1