Hi Merce,
Ah, now I understand. Fortunately there is a simple answer, I think.
count.pl cattest.out cattest --newLine
will cause the end of line markers to be respected, so ngrams will NOT
cross over them.
talisker(56): more cattest.out
7
catis2 2 2
myfriend2 2 2
friendthe1 1 1
ismy1 1 1
thecat1 1 1
So, I believe the --newLine option will do exactly as you require!
Please let me know if there are any other questions or concerns.
Thanks!
Ted
On Wed, Jul 1, 2009 at 1:04 PM, mercevgmerc...@yahoo.es wrote:
Dear Ted,
In my case, I would like to get all the ngrams except those that cross over
the end of line. In your example:
the cat is
my friend the
cat is my friend
I don't want to get as ngrams is my and the cat, those having a new line
in the
middle of it.
As you said, by default count.pl simply ignores end of line markers. But,
it's possible not ignore end of line markers?
Thanks a lot!
Mercè
--- In ngram@yahoogroups.com, Ted Pedersen duluth...@... wrote:
Greetings Merce,
To make sure I understand correctly, it sounds like you *only* want to
see those ngrams that contain a line break. For example, if you run
count.pl as follows on your test file
first line of text
second line
And a third line of text
count.pl test.out test
talisker(8): more test.out
11
lineof2 3 2
oftext2 2 2
lineAnd1 3 1
Anda1 1 1
athird1 1 1
secondline1 1 3
thirdline1 1 3
firstline1 1 3
textsecond1 1 1
You will get the bigrams that cross over the end of line - (text,
second, line And), but you also get all the other ngrams too...and so
it sounds to me like you only want the ones that cross over the new
line markers, and nothing else. Is that accurate?
By default count.pl simply ignores end of line markers (the behavior
you see above). So, it's not so much that the ngram includes the new
line, it simply ignores it. So with a file like
the cat is
my friend the
cat is my friend
the 2 occurrences of the cat would be considered identical, even
though the second could be thought of as having a new line in the
middle of it (but we essentially ignore that).
So...at the moment at least I'm not sure how to limit the output to
only those ngrams that are made by crossing over a new line
markerBut, let me make sure I am understanding things correctly
(so do let me know if I'm wrong) and I'll give this a little more
thought too.
Cordially,
Ted
On Wed, Jul 1, 2009 at 12:15 PM, mercevgmerc...@... wrote:
Dear all,
I would like to know if it's possible to get ngrams without containing
line
breaks from the corpus. I'll try to explain clearly: if the input text
file
is
first line of text
second line
And a third line of text
Then, we'll get with count.pl two bigrams containing like breaks:
text second
line And
Or trigrams:
of text second
text second line
second line And
And so on.
Taking into account these outputs, and after reading help text, I don't
know
if I can change default count.pl options to get all ngrams from the
corpus
except the ngrams containing words placed at the end of one sentence and
words that are at the begining of the next sentence. That is, ngram
without
containing line breaks.
Best wishes,
Mercè
--
Ted Pedersen
http://www.d.umn.edu/~tpederse
--
Ted Pedersen
http://www.d.umn.edu/~tpederse