Re: Calculate Term Co-occurrence Matrix

2010-08-22 Thread ahmed algohary
Thanks! It is exactly what I need. But, isn't there a way to get the
matching score ?

for example, "damaged"  co-occurs with "shipment" with a probability = 0.4
??


On Sun, Aug 22, 2010 at 5:35 AM, Ivan Provalov  wrote:

> Ahmed,
>
> FYI, I updated the term collocations package I mentioned earlier with a few
> fixes and changes which will make it work for Lucene 3.0.2.  This may help
> your task.
>
> See:
> https://issues.apache.org/jira/browse/LUCENE-474
>
> Thanks,
>
> Ivan Provalov
>
>
> --- On Sat, 8/21/10, Otis Gospodnetic  wrote:
>
> > From: Otis Gospodnetic 
> > Subject: Re: Calculate Term Co-occurrence Matrix
> > To: java-user@lucene.apache.org
> > Date: Saturday, August 21, 2010, 8:05 AM
> > Ahmed,
> >
> > That's what that KPE (link in my previous email, below)
> > will do for you.  It's
> > not open source at this time, but that is exactly one of
> > the things it does.  I
> > think Mahout collocations stuff might work for you, too.
> >
> > Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > - Original Message 
> > > From: ahmed algohary 
> > > To: java-user@lucene.apache.org
> > > Sent: Sat, August 21, 2010 7:20:03 AM
> > > Subject: Re: Calculate Term Co-occurrence Matrix
> > >
> > > Thanks for all your answers!
> > >
> > > it seems like I did not make my question  clear.
> > I have a text corpus and I
> > > need to determine the pairs of words that  occur
> > together in many documents.
> > > I need to do that to be able to measure the
> > semantic proximity between
> > > words. This method is expanded
> > > here.
> > > I hope to  find some code that given a text
> > corpus, generate all the words
> > > pairs with  their probability of occurring
> > together.
> > >
> > >
> > > On Sat, Aug 21, 2010 at 1:46  AM, Otis
> > Gospodnetic <
> > > otis_gospodne...@yahoo.com>
> > wrote:
> > >
> > > > There is also a non-Mahout Key Phrase Extractor
> > for  Collocations, SIPs, and
> > > > a
> > > > few other things:
> > > > http://sematext.com/products/key-phrase-extractor/index.html
> > > >
> > > >  One of the demos that uses news data is at
> > > > http://sematext.com/demo/kpe/index.html
> > > >
> > > > Otis
> > > >  
> > > > Sematext :: http://sematext.com/ :: Solr - Lucene -
> > Nutch
> > > > Lucene ecosystem  search :: http://search-lucene.com/
> > > >
> > > >
> > > >
> > > > - Original  Message 
> > > > > From: Grant Ingersoll 
> > > > > To: java-user@lucene.apache.org
> > > >  > Sent: Fri, August 20, 2010 8:52:17 AM
> > > > > Subject: Re: Calculate  Term
> > Co-occurrence Matrix
> > > > >
> > > > > You might also be interested  in
> > Mahout's collocations package:
> > > > >http://cwiki.apache.org/confluence/display/MAHOUT/Collocations
> > > >  >
> > > > > -Grant
> > > > > On  Aug 19, 2010, at 11:39 AM,
> > ahmed  algohary wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > >  > > I need to know if there is a
> > Lucene plug-in or a Lucene-based  API  for
> > > > > > calculating the term co-occurrence
> > matrix for a  given text  corpus.
> > > > > >
> > > > > > Thanks!
> > > >  > >
> > > > > > --
> > > > > >  Ahmed
> > > >  >
> > > > > --
> > > > > Grant  Ingersoll
> > > > > http://www.lucidimagination.com/
> > > > >
> > > > > Search the  Lucene ecosystem
> > using  Solr/Lucene:
> > > > >http://www.lucidimagination.com/search
> > > > >
> > > > >
> > > >  >
> > -
> > > >  > To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > >  > For  additional commands, e-mail:
> > java-user-h...@lucene.apache.org
> > > >  >
> > > > >
> > > >
> > > >
> > -
> > > > To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > >  For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> > >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Calculate Term Co-occurrence Matrix

2010-08-22 Thread ahmed algohary
I think I got it.

In the CollectionIndexer class, I have added the co-occurrence score to the
index document:

 doc.add(new Field("score", collocation.getScore() + "",
Field.Store.YES, Field.Index.NOT_ANALYZED));

then in the CollectionSearcher, the scores can be retrieved:

 d.get("score")

Is that correct ??

On Sun, Aug 22, 2010 at 2:47 PM, ahmed algohary wrote:

> Thanks! It is exactly what I need. But, isn't there a way to get the
> matching score ?
>
> for example, "damaged"  co-occurs with "shipment" with a probability = 0.4
> ??
>
>
> On Sun, Aug 22, 2010 at 5:35 AM, Ivan Provalov  wrote:
>
>> Ahmed,
>>
>> FYI, I updated the term collocations package I mentioned earlier with a
>> few fixes and changes which will make it work for Lucene 3.0.2.  This may
>> help your task.
>>
>> See:
>> https://issues.apache.org/jira/browse/LUCENE-474
>>
>> Thanks,
>>
>> Ivan Provalov
>>
>>
>> --- On Sat, 8/21/10, Otis Gospodnetic  wrote:
>>
>> > From: Otis Gospodnetic 
>> > Subject: Re: Calculate Term Co-occurrence Matrix
>> > To: java-user@lucene.apache.org
>> > Date: Saturday, August 21, 2010, 8:05 AM
>> > Ahmed,
>> >
>> > That's what that KPE (link in my previous email, below)
>> > will do for you.  It's
>> > not open source at this time, but that is exactly one of
>> > the things it does.  I
>> > think Mahout collocations stuff might work for you, too.
>> >
>> > Otis
>> > 
>> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> > Lucene ecosystem search :: http://search-lucene.com/
>> >
>> >
>> >
>> > - Original Message 
>> > > From: ahmed algohary 
>> > > To: java-user@lucene.apache.org
>> > > Sent: Sat, August 21, 2010 7:20:03 AM
>> > > Subject: Re: Calculate Term Co-occurrence Matrix
>> > >
>> > > Thanks for all your answers!
>> > >
>> > > it seems like I did not make my question  clear.
>> > I have a text corpus and I
>> > > need to determine the pairs of words that  occur
>> > together in many documents.
>> > > I need to do that to be able to measure the
>> > semantic proximity between
>> > > words. This method is expanded
>> > > here.
>> > > I hope to  find some code that given a text
>> > corpus, generate all the words
>> > > pairs with  their probability of occurring
>> > together.
>> > >
>> > >
>> > > On Sat, Aug 21, 2010 at 1:46  AM, Otis
>> > Gospodnetic <
>> > > otis_gospodne...@yahoo.com>
>> > wrote:
>> > >
>> > > > There is also a non-Mahout Key Phrase Extractor
>> > for  Collocations, SIPs, and
>> > > > a
>> > > > few other things:
>> > > > http://sematext.com/products/key-phrase-extractor/index.html
>> > > >
>> > > >  One of the demos that uses news data is at
>> > > > http://sematext.com/demo/kpe/index.html
>> > > >
>> > > > Otis
>> > > >  
>> > > > Sematext :: http://sematext.com/ :: Solr - Lucene -
>> > Nutch
>> > > > Lucene ecosystem  search :: http://search-lucene.com/
>> > > >
>> > > >
>> > > >
>> > > > - Original  Message 
>> > > > > From: Grant Ingersoll 
>> > > > > To: java-user@lucene.apache.org
>> > > >  > Sent: Fri, August 20, 2010 8:52:17 AM
>> > > > > Subject: Re: Calculate  Term
>> > Co-occurrence Matrix
>> > > > >
>> > > > > You might also be interested  in
>> > Mahout's collocations package:
>> > > > >http://cwiki.apache.org/confluence/display/MAHOUT/Collocations
>> > > >  >
>> > > > > -Grant
>> > > > > On  Aug 19, 2010, at 11:39 AM,
>> > ahmed  algohary wrote:
>> > > > >
>> > > > > > Hi all,
>> > > > > >
>> > > >  > > I need to know if there is a
>> > Lucene plug-in or a Lucene-based  API  for
>> > > > > > calculating the term co-occurrence
>> > matrix for a  given text  corpus.
>> > > > > >
>> > > > > > Thanks!
>> > > >  > >
>> > > > > > --
>> > > > > >  Ahmed
>> > > >  >
>> > > > > --
>> > > > > Grant  Ingersoll
>> > > > > http://www.lucidimagination.com/
>> > > > >
>> > > > > Search the  Lucene ecosystem
>> > using  Solr/Lucene:
>> > > > >http://www.lucidimagination.com/search
>> > > > >
>> > > > >
>> > > >  >
>> > -
>> > > >  > To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > > >  > For  additional commands, e-mail:
>> > java-user-h...@lucene.apache.org
>> > > >  >
>> > > > >
>> > > >
>> > > >
>> > -
>> > > > To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > > >  For additional commands, e-mail: java-user-h...@lucene.apache.org
>> > > >
>> > > >
>> > >
>> >
>> > -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user

Re: Calculate Term Co-occurrence Matrix

2010-08-22 Thread Ivan Provalov
Ahmed,

Instead, I would use the score coming out of the CollocationSearcher class. I 
changed it a bit to return the LinkedHashMap of collocated terms and their 
scores relative to the term used in the query.  I have attached the new version.

Thanks,

IP



--- On Sun, 8/22/10, ahmed algohary  wrote:

> From: ahmed algohary 
> Subject: Re: Calculate Term Co-occurrence Matrix
> To: java-user@lucene.apache.org
> Date: Sunday, August 22, 2010, 9:27 AM
> I think I got it.
> 
> In the CollectionIndexer class, I have added the
> co-occurrence score to the
> index document:
> 
>  doc.add(new Field("score", collocation.getScore() + "",
>                
> Field.Store.YES, Field.Index.NOT_ANALYZED));
> 
> then in the CollectionSearcher, the scores can be
> retrieved:
> 
>  d.get("score")
> 
> Is that correct ??
> 
> On Sun, Aug 22, 2010 at 2:47 PM, ahmed algohary wrote:
> 
> > Thanks! It is exactly what I need. But, isn't there a
> way to get the
> > matching score ?
> >
> > for example, "damaged"  co-occurs with "shipment"
> with a probability = 0.4
> > ??
> >
> >
> > On Sun, Aug 22, 2010 at 5:35 AM, Ivan Provalov 
> wrote:
> >
> >> Ahmed,
> >>
> >> FYI, I updated the term collocations package I
> mentioned earlier with a
> >> few fixes and changes which will make it work for
> Lucene 3.0.2.  This may
> >> help your task.
> >>
> >> See:
> >> https://issues.apache.org/jira/browse/LUCENE-474
> >>
> >> Thanks,
> >>
> >> Ivan Provalov
> >>
> >>
> >> --- On Sat, 8/21/10, Otis Gospodnetic 
> wrote:
> >>
> >> > From: Otis Gospodnetic 
> >> > Subject: Re: Calculate Term Co-occurrence
> Matrix
> >> > To: java-user@lucene.apache.org
> >> > Date: Saturday, August 21, 2010, 8:05 AM
> >> > Ahmed,
> >> >
> >> > That's what that KPE (link in my previous
> email, below)
> >> > will do for you.  It's
> >> > not open source at this time, but that is
> exactly one of
> >> > the things it does.  I
> >> > think Mahout collocations stuff might work
> for you, too.
> >> >
> >> > Otis
> >> > 
> >> > Sematext :: http://sematext.com/ :: Solr - Lucene -
> Nutch
> >> > Lucene ecosystem search :: http://search-lucene.com/
> >> >
> >> >
> >> >
> >> > - Original Message 
> >> > > From: ahmed algohary 
> >> > > To: java-user@lucene.apache.org
> >> > > Sent: Sat, August 21, 2010 7:20:03 AM
> >> > > Subject: Re: Calculate Term
> Co-occurrence Matrix
> >> > >
> >> > > Thanks for all your answers!
> >> > >
> >> > > it seems like I did not make my
> question  clear.
> >> > I have a text corpus and I
> >> > > need to determine the pairs of words
> that  occur
> >> > together in many documents.
> >> > > I need to do that to be able to measure
> the
> >> > semantic proximity between
> >> > > words. This method is expanded
> >> > > here.
> >> > > I hope to  find some code that
> given a text
> >> > corpus, generate all the words
> >> > > pairs with  their probability of
> occurring
> >> > together.
> >> > >
> >> > >
> >> > > On Sat, Aug 21, 2010 at 1:46  AM,
> Otis
> >> > Gospodnetic <
> >> > > otis_gospodne...@yahoo.com>
> >> > wrote:
> >> > >
> >> > > > There is also a non-Mahout Key
> Phrase Extractor
> >> > for  Collocations, SIPs, and
> >> > > > a
> >> > > > few other things:
> >> > > > http://sematext.com/products/key-phrase-extractor/index.html
> >> > > >
> >> > > >  One of the demos that uses
> news data is at
> >> > > > http://sematext.com/demo/kpe/index.html
> >> > > >
> >> > > > Otis
> >> > > >  
> >> > > > Sematext :: http://sematext.com/ :: Solr - Lucene -
> >> > Nutch
> >> > > > Lucene ecosystem  search :: http://search-lucene.com/
> >> > > >
> >> > > >
> >> > > >
> >> > > > - Original  Message 
> >> > > > > From: Grant Ingersoll 
> >> > > > > To: java-user@lucene.apache.org
> >> > > >  > Sent: Fri, August 20,
> 2010 8:52:17 AM
> >> > > > > Subject: Re: Calculate 
> Term
> >> > Co-occurrence Matrix
> >> > > > >
> >> > > > > You might also be
> interested  in
> >> > Mahout's collocations package:
> >> > > > >http://cwiki.apache.org/confluence/display/MAHOUT/Collocations
> >> > > >  >
> >> > > > > -Grant
> >> > > > > On  Aug 19, 2010, at
> 11:39 AM,
> >> > ahmed  algohary wrote:
> >> > > > >
> >> > > > > > Hi all,
> >> > > > > >
> >> > > >  > > I need to know if
> there is a
> >> > Lucene plug-in or a Lucene-based 
> API  for
> >> > > > > > calculating the term
> co-occurrence
> >> > matrix for a  given text  corpus.
> >> > > > > >
> >> > > > > > Thanks!
> >> > > >  > >
> >> > > > > > --
> >> > > > > >  Ahmed
> >> > > >  >
> >> > > > > --
> >> > > > > Grant  Ingersoll
> >> > > > > http://www.lucidimagination.com/
> >> > > > >
> >> > > > > Search the  Lucene
> ecosystem
> >> > using  Solr/Lucene:
> >> > > > >http://www.lucidimagination.com/search
> >> > > > >
> >> > > > >
> >> > > >  >
> >> >
> -
> >> > > >  > To  unsubscribe,
> e-mail: java-user-unsubscr...@lucene.apache

IndexWriter.deleteDocuments(Query[]) not deleting

2010-08-22 Thread Paul J. Lucas
Hi -

Using Lucene 2.9.3, I'm indexing the metadata in image files.  For each image 
("document" in Lucene), I have 2 additional special fields: "FILE-PATH" 
(containing the full path of the file) and "DIR-PATH" (containing the full path 
of the directory the file is in).

The FILE-PATH Field is created only once like:

private final Field m_fieldFilePath = new Field(
"FILE-PATH", "INIT", Field.Store.YES, Field.Index.NOT_ANALYZED
);

and reused; the DIR-PATH Field is created once per document like:

new Field(
"DIR-PATH", file.getParentFile().getAbsolutePath(),
Field.Store.NO, Field.Index.NOT_ANALYZED
)

(The reason the DIR-PATH Field is created once per document is because it's 
part of indexing the rest of the image metadata and isn't a special-case like 
FILE-PATH.  I don't believe this is relevant to the problem at hand, however.)

If an image file (or an entire directory of image files) gets deleted, I need 
to delete it (them) from the index.  When deleting a single image, I could do:

Term fileTerm = new Term( "FILE-PATH", file.getAbsolutePath() );
writer.deleteDocuments( new TermQuery( fileTerm ) );

When deleting an entire directory of images, I could do:

Term dirTerm = new Term( "DIR-PATH", file.getAbsolutePath() );
writer.deleteDocuments( new TermQuery( dirTerm ) );

However, at the time of deletion, I don't know whether "file" refers to a 
single image file or to a directory of images files.  I can't do file.isFile() 
or file.isDirectory() because "file" no longer exists (it was deleted).  So to 
cover both cases, I do:

Query[] queries = new Query[]{
new TermQuery( fileTerm ),
new TermQuery( dirTerm )
};
writer.deleteDocuments( queries );

I have non-Lucene code that monitors the filesystem for changes.  For Mac OS X, 
I can only get directory-level change notifications.  So if a file is deleted 
from a directory, I get a notification that the directory has changed.  So I 
delete all the documents in that directory then re-add them.

However (and here's the problem), the deletes never happen.  If I delete a file 
from a directory, the directory (looks like) its unindexed and reindexed, but a 
query for that image file still returns a result.  So it's like the delete 
never happened.

Why not?

Additional information: I create/close a new IndexWriter for the delete.  Even 
if I quit the application, relaunch, and run the query, the result still shows 
up (hence it's not that the current reader isn't seeing the deletion change).

- Paul


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexWriter.deleteDocuments(Query[]) not deleting

2010-08-22 Thread Erick Erickson
Did you issue a commit (or close) the IndexWriter after you deleted the
documents?
And I'm assuming that something really weird didn't happen like a case
change,
but your NOT_ANALYZED should take care of that at index time, but are you
sure
your cases match when you submit your term queries?

An interesting test would be to write out the file names you create your
terms
from, and see what happens if you search on those fields etc

HTH
Erick

On Sun, Aug 22, 2010 at 12:24 PM, Paul J. Lucas  wrote:

> Hi -
>
> Using Lucene 2.9.3, I'm indexing the metadata in image files.  For each
> image ("document" in Lucene), I have 2 additional special fields:
> "FILE-PATH" (containing the full path of the file) and "DIR-PATH"
> (containing the full path of the directory the file is in).
>
> The FILE-PATH Field is created only once like:
>
>private final Field m_fieldFilePath = new Field(
>"FILE-PATH", "INIT", Field.Store.YES, Field.Index.NOT_ANALYZED
>);
>
> and reused; the DIR-PATH Field is created once per document like:
>
>new Field(
>"DIR-PATH", file.getParentFile().getAbsolutePath(),
>Field.Store.NO, Field.Index.NOT_ANALYZED
>)
>
> (The reason the DIR-PATH Field is created once per document is because it's
> part of indexing the rest of the image metadata and isn't a special-case
> like FILE-PATH.  I don't believe this is relevant to the problem at hand,
> however.)
>
> If an image file (or an entire directory of image files) gets deleted, I
> need to delete it (them) from the index.  When deleting a single image, I
> could do:
>
>Term fileTerm = new Term( "FILE-PATH", file.getAbsolutePath() );
>writer.deleteDocuments( new TermQuery( fileTerm ) );
>
> When deleting an entire directory of images, I could do:
>
>Term dirTerm = new Term( "DIR-PATH", file.getAbsolutePath() );
>writer.deleteDocuments( new TermQuery( dirTerm ) );
>
> However, at the time of deletion, I don't know whether "file" refers to a
> single image file or to a directory of images files.  I can't do
> file.isFile() or file.isDirectory() because "file" no longer exists (it was
> deleted).  So to cover both cases, I do:
>
>Query[] queries = new Query[]{
>new TermQuery( fileTerm ),
>new TermQuery( dirTerm )
>};
>writer.deleteDocuments( queries );
>
> I have non-Lucene code that monitors the filesystem for changes.  For Mac
> OS X, I can only get directory-level change notifications.  So if a file is
> deleted from a directory, I get a notification that the directory has
> changed.  So I delete all the documents in that directory then re-add them.
>
> However (and here's the problem), the deletes never happen.  If I delete a
> file from a directory, the directory (looks like) its unindexed and
> reindexed, but a query for that image file still returns a result.  So it's
> like the delete never happened.
>
> Why not?
>
> Additional information: I create/close a new IndexWriter for the delete.
>  Even if I quit the application, relaunch, and run the query, the result
> still shows up (hence it's not that the current reader isn't seeing the
> deletion change).
>
> - Paul
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: IndexWriter.deleteDocuments(Query[]) not deleting

2010-08-22 Thread Paul J. Lucas
On Aug 22, 2010, at 1:47 PM, Erick Erickson wrote:

> Did you issue a commit (or close) the IndexWriter after you deleted the
> documents?

I originally wrote:

> I create/close a new IndexWriter for the delete.

So the answer is "yes."

> ... are you sure your cases match when you submit your term queries?

Yes.

> An interesting test would be to write out the file names you create your
> terms from, and see what happens if you search on those fields etc

Never mind.  I figured it out.  (Don't you hate it when you can't figure 
something out, you write-up a detailed question, post it, then go off an figure 
it out afterwards?)

The problem was the directory was being stored in the index like:

/path/to/file/

(with the trailing slash).  The delete query, however, didn't have the trailing 
slash since File.getAbsolutePath() doesn't return trailing file separator 
characters.  D'oh!

- Paul


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexWriter.deleteDocuments(Query[]) not deleting

2010-08-22 Thread Erick Erickson
Yep, sure hate it when that happens, which doesn't prevent it
happening to me more often than I'd like :).

Glad you figured it out.

Erick

On Sun, Aug 22, 2010 at 3:04 PM, Paul J. Lucas  wrote:

> On Aug 22, 2010, at 1:47 PM, Erick Erickson wrote:
>
> > Did you issue a commit (or close) the IndexWriter after you deleted the
> > documents?
>
> I originally wrote:
>
> > I create/close a new IndexWriter for the delete.
>
> So the answer is "yes."
>
> > ... are you sure your cases match when you submit your term queries?
>
> Yes.
>
> > An interesting test would be to write out the file names you create your
> > terms from, and see what happens if you search on those fields etc
>
> Never mind.  I figured it out.  (Don't you hate it when you can't figure
> something out, you write-up a detailed question, post it, then go off an
> figure it out afterwards?)
>
> The problem was the directory was being stored in the index like:
>
>/path/to/file/
>
> (with the trailing slash).  The delete query, however, didn't have the
> trailing slash since File.getAbsolutePath() doesn't return trailing file
> separator characters.  D'oh!
>
> - Paul
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>