[ngram] Re: Ngrams without line break

2009-07-01 Thread mercevg
Dear Ted,

In my case, I would like to get all the ngrams except those that cross over the 
end of line. In your example:

the cat is
my friend the
cat is my friend

I don't want to get as ngrams is my and the cat, those having a new line in 
the
middle of it. 

As you said, by default count.pl simply ignores end of line markers. But, it's 
possible not ignore end of line markers?  

Thanks a lot!
Mercè

--- In ngram@yahoogroups.com, Ted Pedersen duluth...@... wrote:

 Greetings Merce,
 
 To make sure I understand correctly, it sounds like you *only* want to
 see those ngrams that contain a line break. For example, if you run
 count.pl as follows on your test file
 
 first line of text
 second line
 And a third line of text
 
 count.pl test.out test
 
 talisker(8): more test.out
 11
 lineof2 3 2
 oftext2 2 2
 lineAnd1 3 1
 Anda1 1 1
 athird1 1 1
 secondline1 1 3
 thirdline1 1 3
 firstline1 1 3
 textsecond1 1 1
 
 You will get the bigrams that cross over the end of line - (text,
 second, line And), but you also get all the other ngrams too...and so
 it sounds to me like you only want the ones that cross over the new
 line markers, and nothing else. Is that accurate?
 
 By default count.pl simply ignores end of line markers (the behavior
 you see above). So, it's not so much that the ngram includes the new
 line, it simply ignores it. So with a file like
 
 the cat is
 my friend the
 cat is my friend
 
 the 2 occurrences of the cat would be considered identical, even
 though the second could be thought of as having a new line in the
 middle of it (but we essentially ignore that).
 
 So...at the moment at least I'm not sure how to limit the output to
 only those ngrams that are made by crossing over a new line
 markerBut, let me make sure I am understanding things correctly
 (so do let me know if I'm wrong) and I'll give this a little more
 thought too.
 
 Cordially,
 Ted
 
 
 On Wed, Jul 1, 2009 at 12:15 PM, mercevgmerc...@... wrote:
 
 
  Dear all,
 
  I would like to know if it's possible to get ngrams without containing line
  breaks from the corpus. I'll try to explain clearly: if the input text file
  is
 
  first line of text
  second line
  And a third line of text
 
  Then, we'll get with count.pl two bigrams containing like breaks:
 
  text second
  line And
 
  Or trigrams:
  of text second
  text second line
  second line And
 
  And so on.
 
  Taking into account these outputs, and after reading help text, I don't know
  if I can change default count.pl options to get all ngrams from the corpus
  except the ngrams containing words placed at the end of one sentence and
  words that are at the begining of the next sentence. That is, ngram without
  containing line breaks.
 
  Best wishes,
  Mercè
 
  
 
 
 
 -- 
 Ted Pedersen
 http://www.d.umn.edu/~tpederse





Re: [ngram] Re: Ngrams without line break

2009-07-01 Thread Ted Pedersen
Hi Merce,

Ah, now I understand. Fortunately there is a simple answer, I think.

count.pl cattest.out cattest --newLine

will cause the end of line markers to be respected, so ngrams will NOT
cross over them.

talisker(56): more cattest.out
7
catis2 2 2
myfriend2 2 2
friendthe1 1 1
ismy1 1 1
thecat1 1 1

So, I believe the --newLine option will do exactly as you require!

Please let me know if there are any other questions or concerns.

Thanks!
Ted

On Wed, Jul 1, 2009 at 1:04 PM, mercevgmerc...@yahoo.es wrote:


 Dear Ted,

 In my case, I would like to get all the ngrams except those that cross over
 the end of line. In your example:

 the cat is
 my friend the
 cat is my friend

 I don't want to get as ngrams is my and the cat, those having a new line
 in the
 middle of it.

 As you said, by default count.pl simply ignores end of line markers. But,
 it's possible not ignore end of line markers?

 Thanks a lot!
 Mercè

 --- In ngram@yahoogroups.com, Ted Pedersen duluth...@... wrote:

 Greetings Merce,

 To make sure I understand correctly, it sounds like you *only* want to
 see those ngrams that contain a line break. For example, if you run
 count.pl as follows on your test file

 first line of text
 second line
 And a third line of text

 count.pl test.out test

 talisker(8): more test.out
 11
 lineof2 3 2
 oftext2 2 2
 lineAnd1 3 1
 Anda1 1 1
 athird1 1 1
 secondline1 1 3
 thirdline1 1 3
 firstline1 1 3
 textsecond1 1 1

 You will get the bigrams that cross over the end of line - (text,
 second, line And), but you also get all the other ngrams too...and so
 it sounds to me like you only want the ones that cross over the new
 line markers, and nothing else. Is that accurate?

 By default count.pl simply ignores end of line markers (the behavior
 you see above). So, it's not so much that the ngram includes the new
 line, it simply ignores it. So with a file like

 the cat is
 my friend the
 cat is my friend

 the 2 occurrences of the cat would be considered identical, even
 though the second could be thought of as having a new line in the
 middle of it (but we essentially ignore that).

 So...at the moment at least I'm not sure how to limit the output to
 only those ngrams that are made by crossing over a new line
 markerBut, let me make sure I am understanding things correctly
 (so do let me know if I'm wrong) and I'll give this a little more
 thought too.

 Cordially,
 Ted


 On Wed, Jul 1, 2009 at 12:15 PM, mercevgmerc...@... wrote:
 
 
  Dear all,
 
  I would like to know if it's possible to get ngrams without containing
  line
  breaks from the corpus. I'll try to explain clearly: if the input text
  file
  is
 
  first line of text
  second line
  And a third line of text
 
  Then, we'll get with count.pl two bigrams containing like breaks:
 
  text second
  line And
 
  Or trigrams:
  of text second
  text second line
  second line And
 
  And so on.
 
  Taking into account these outputs, and after reading help text, I don't
  know
  if I can change default count.pl options to get all ngrams from the
  corpus
  except the ngrams containing words placed at the end of one sentence and
  words that are at the begining of the next sentence. That is, ngram
  without
  containing line breaks.
 
  Best wishes,
  Mercè
 
 



 --
 Ted Pedersen
 http://www.d.umn.edu/~tpederse


 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse