Dear all,
I would like to know if it's possible to get ngrams without containing line
breaks from the corpus. I'll try to explain clearly: if the input text file is
first line of text
second line
And a third line of text
Then, we'll get with count.pl two bigrams containing like breaks:
Greetings Merce,
To make sure I understand correctly, it sounds like you *only* want to
see those ngrams that contain a line break. For example, if you run
count.pl as follows on your test file
first line of text
second line
And a third line of text
count.pl test.out test
talisker(8): more
Dear Ted,
In my case, I would like to get all the ngrams except those that cross over the
end of line. In your example:
the cat is
my friend the
cat is my friend
I don't want to get as ngrams is my and the cat, those having a new line in
the
middle of it.
As you said, by default count.pl
Hi Merce,
Ah, now I understand. Fortunately there is a simple answer, I think.
count.pl cattest.out cattest --newLine
will cause the end of line markers to be respected, so ngrams will NOT
cross over them.
talisker(56): more cattest.out
7
catis2 2 2
myfriend2 2 2
friendthe1 1 1
ismy1 1 1