Re: Word co-occurrences counts

2004-12-28 Thread Andrew Cunningham
Thanks Doug,
This appears to works like a charm.
Doug Cutting wrote:
Doug Cutting wrote:
You could use a custom Similarity implementation for this query, 
where tf() is the identity function, idf() returns 1.0, etc., so that 
the final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to 
get rid of the lengthNorm() and field boost (if any).

Much simpler would be to build a SpanNearQuery, call getSpans(), then 
loop, counting how many times Spans.next() returns true.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Daniel Naber
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:

 1.To be able to return the number of times the word appears in all
 the documents (which it looks like lucene can do through IndexReader)

If you're referring to docFreq(Term t) , that will only return the number 
of documents that contain the term, ignoring how often the term occurs in 
these documents.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word co-occurrences counts

2004-12-23 Thread Andrew Cunningham
Ah, so is it possible to return the number of times a term appears?
Daniel Naber wrote:
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:
 

1.  To be able to return the number of times the word appears in all
the documents (which it looks like lucene can do through IndexReader)
   

If you're referring to docFreq(Term t) , that will only return the number 
of documents that contain the term, ignoring how often the term occurs in 
these documents.

Regards
Daniel
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Andrew Cunningham
computer dog~50 looks like what I'm after - now is there someway I can 
call this and pull
out the number of total occurances, not just the number of documents 
hits? (say if computer
and dog occur near each other several times in the same document).

Paul Elschot wrote:
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:
 

Hi all,
I have a curious problem, and initial poking around with Lucene looks
like it may only be able to half-handle the problem.

The problem requires two abilities:
1.	To be able to return the number of times the word appears in all
the documents (which it looks like lucene can do through IndexReader) 
2.	To be able to return the number of word co-occurrences within
the document set (ie. How many times does computer appear within 50
words of  dog) 


Is the second point possible?
   

You can use the standard query parser with a query like this:
dog computer~50
This query is not completely symmetric in the distance computation:
when computer occurs before dog, the allowed distance is 49, iirc.
There is also a SpanNearQuery for more generalized and flexible
distance queries, but this is not supported by the query parser,
so you'll have to construct these queries in your own program code.
In case you have non standard retrieval requirements, eg. you only
need the number of hits and no further information from the matching
documents, you may consider using your own HitCollector on the
lower level search methods.
Regards,
Paul Elschot
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Doug Cutting
Andrew Cunningham wrote:
computer dog~50 looks like what I'm after - now is there someway I can 
call this and pull
out the number of total occurances, not just the number of documents 
hits? (say if computer
and dog occur near each other several times in the same document).
You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to get 
rid of the lengthNorm() and field boost (if any).

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Doug Cutting
Doug Cutting wrote:
You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to get 
rid of the lengthNorm() and field boost (if any).
Much simpler would be to build a SpanNearQuery, call getSpans(), then 
loop, counting how many times Spans.next() returns true.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Andrew Cunningham
Thanks Doug and all,
I'm intending to use Lucene to grab a lot of word co-occurance 
statistics out of a large corpus
to perform word disambiguation. Lucene's looking like a great option, 
but I appear to have hit
a snag. Here's my understanding:

1) Create a Similarity implementation, where:
   tf() returns freq
   sloppyFreq, idf, coord, return 1 (cause we only need to freq to score)
2) Perform the query
3) and then:
   word in document count = 
hits.score(k)/Similarity.decodeNorm(reader.norms(contents)[k])
4) A query call such as
   computer dog~50
   will return a count of 2 (I assume because the match occurs 
backwards and forwards).

My problem occurs when I have the following in a text file:
   computer ...(some words)... dog ...(some words)... computer
and I duplicate the text file several times over. Performing a the above 
query will return different
phrase counts per document?

Note: I'm just working with some modified demo code at the moment.
Thanks again,
Andrew
Doug Cutting wrote:
Andrew Cunningham wrote:
computer dog~50 looks like what I'm after - now is there someway I 
can call this and pull
out the number of total occurances, not just the number of documents 
hits? (say if computer
and dog occur near each other several times in the same document).

You could use a custom Similarity implementation for this query, where 
tf() is the identity function, idf() returns 1.0, etc., so that the 
final score is the occurance count.  You'll need to divide by 
Similarity.decodeNorm(indexReader.norms(field)[doc]) at the end to 
get rid of the lengthNorm() and field boost (if any).

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Word co-occurrences counts

2004-12-23 Thread Erik Hatcher
On Dec 24, 2004, at 12:40 AM, Andrew Cunningham wrote:
3) and then:
   word in document count = 
hits.score(k)/Similarity.decodeNorm(reader.norms(contents)[k])
You should use hits.id(k), not k, as the index to 
reader.norms(contents).

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Word co-occurrences counts

2004-12-22 Thread Andrew.Cunningham
Hi all,

I have a curious problem, and initial poking around with Lucene looks
like it may only be able to half-handle the problem.

 

The problem requires two abilities:

1.  To be able to return the number of times the word appears in all
the documents (which it looks like lucene can do through IndexReader) 
2.  To be able to return the number of word co-occurrences within
the document set (ie. How many times does computer appear within 50
words of  dog) 

 

Is the second point possible?

 

Thanks all, and happy holidays,

Andrew

 



Re: Word co-occurrences counts

2004-12-22 Thread Paul Elschot
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:
 Hi all,
 
 I have a curious problem, and initial poking around with Lucene looks
 like it may only be able to half-handle the problem.
 
  
 
 The problem requires two abilities:
 
 1.To be able to return the number of times the word appears in all
 the documents (which it looks like lucene can do through IndexReader) 
 2.To be able to return the number of word co-occurrences within
 the document set (ie. How many times does computer appear within 50
 words of  dog) 

  
 
 Is the second point possible?

You can use the standard query parser with a query like this:
dog computer~50
This query is not completely symmetric in the distance computation:
when computer occurs before dog, the allowed distance is 49, iirc.

There is also a SpanNearQuery for more generalized and flexible
distance queries, but this is not supported by the query parser,
so you'll have to construct these queries in your own program code.

In case you have non standard retrieval requirements, eg. you only
need the number of hits and no further information from the matching
documents, you may consider using your own HitCollector on the
lower level search methods.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]