Document Clustering

2003-11-11 Thread marc
Hi, does anyone have any sample code/documentation available for doing document based clustering using lucene? Thanks, Marc

Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
Hi Marc, I'm working on it. Classification and Clustering as well. I was planing doing it for nutch.org, but actually some guys there breakup some important basic work I already had done, so may be i will not contribute it there. However it will be open source and I can notice you if something

Re: Document Clustering

2003-11-11 Thread Eric Jain
I'm working on it. Classification and Clustering as well. Very interesting... if you get something working, please don't forget to notify this list :-) -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional

Re: Document Clustering

2003-11-11 Thread Marcel Stör
Hi As everybody seems to be so exited about it, would someone please be so kind to explain what document based clustering is? Regards, Marcel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL

Re: Document Clustering

2003-11-11 Thread Leo Galambos
Marcel Stör wrote: Hi As everybody seems to be so exited about it, would someone please be so kind to explain what document based clustering is? Hi they are trying to implement what you can see in the right panel here: http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein They may also

Re: Document Clustering

2003-11-11 Thread Otis Gospodnetic
--- Leo Galambos [EMAIL PROTECTED] wrote: Marcel Stör wrote: Hi As everybody seems to be so exited about it, would someone please be so kind to explain what document based clustering is? AFAIK, document clustering consists of detection of documents with similar content (similar

Re: Document Clustering

2003-11-11 Thread petite_abeille
On Nov 11, 2003, at 16:05, Marcel Stör wrote: As everybody seems to be so exited about it, would someone please be so kind to explain what document based clustering is? This mostly means finding document which are similar in some way(s). The similitude is mostly in the eyes of the beholder. In

RE: Document Clustering

2003-11-11 Thread Tate Avery
Categorization typically assigns documents to a node in a pre-defined taxonomy. For clustering, however, the categorization 'structure' is emergent... i.e. the clusters (which are analogous to taxonomy nodes) are created dynamically based on the content of the documents at hand. -Original

Re: Document Clustering

2003-11-11 Thread petite_abeille
On Nov 11, 2003, at 16:58, Tate Avery wrote: Categorization typically assigns documents to a node in a pre-defined taxonomy. For clustering, however, the categorization 'structure' is emergent... i.e. the clusters (which are analogous to taxonomy nodes) are created dynamically based on the

Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Classification: you have already categories and samples for it, that help you to match

Re: Document Clustering

2003-11-11 Thread Otis Gospodnetic
Thanks for the clarification, Stefan. I should have known that... :) Otis --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all

Reopen IndexWriter after delete?

2003-11-11 Thread Wilton, Reece
Hi, A couple questions... 1). If I delete a term using an IndexReader, can I use an existing IndexWriter to write to the index? Or do I need to close and reopen the IndexWriter? 2). Is it safe to call IndexReader.delete(term) while an IndexWriter is writing? Or should I be synchronizing

Index pdf files with your content in lucene.

2003-11-11 Thread Ernesto De Santis
Classes for index Pdf and word files in lucene. Ernesto. - Original Message - From: Ernesto De Santis [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, October 29, 2003 12:04 PM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello all, Thans very much

RE: Document Clustering

2003-11-11 Thread Marcel Stor
Stefan Groschupf wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Would I be correct to say that you have to define a distance

Re: Document Clustering

2003-11-11 Thread Joshua O'Madadhain
On Tuesday, Nov 11, 2003, at 11:05 US/Pacific, Marcel Stor wrote: Stefan Groschupf wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together.

RE: Index pdf files with your content in lucene.

2003-11-11 Thread Wilton, Reece
Some of us have corporate firewalls that are stripping out attachments. If possible, put these on a web site somewhere so we can download them. Thanks! -Original Message- From: Ernesto De Santis [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 11, 2003 11:07 AM To: Lucene Users List

Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
Marcel Stor wrote: Stefan Groschupf wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Would I be correct to say

fuzzy searches

2003-11-11 Thread Thomas Krämer
Hello , now that the topic is clustering methods: has there been any effort in implementing Latent semantic indexing in Lucene? Google only indicates someone else asking this in february. Is there an overview of the structure of the index of lucene despite of the javadoc or any other fast

Re: fuzzy searches

2003-11-11 Thread Gerret Apelt
Thomas Krämer wrote: Is there an overview of the structure of the index of lucene despite of the javadoc or any other fast access to understanding what happens inside lucene? You mean something like this?: http://jakarta.apache.org/lucene/docs/fileformats.html cheers, Gerret

Re: fuzzy searches

2003-11-11 Thread Bruce Ritchie
Thomas Krämer wrote: now that the topic is clustering methods: has there been any effort in implementing Latent semantic indexing in Lucene? Google only indicates someone else asking this in february. Just a note the LSI is encumbered by US patents 4,839,853 and 5,301,109. It would be wise to

Re: fuzzy searches

2003-11-11 Thread Erik Hatcher
On Tuesday, November 11, 2003, at 02:37 PM, Thomas Krämer wrote: Is there an overview of the structure of the index of lucene despite of the javadoc or any other fast access to understanding what happens inside lucene? Here is what is inside a Lucene index:

Re: Document Clustering

2003-11-11 Thread maurits van wijland
Hi All and Marc, There is the carrot project : http://www.cs.put.poznan.pl/dweiss/carrot/ The carrot system consists of webservices that can easily be fed by a lucene resultlist. You simply have to create a JSP that creates this XML file and create a custom process and input component. The input

Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
really cool Stuff!!! maurits van wijland wrote: Hi All and Marc, There is the carrot project : http://www.cs.put.poznan.pl/dweiss/carrot/ The carrot system consists of webservices that can easily be fed by a lucene resultlist. You simply have to create a JSP that creates this XML file and

Re: Reopen IndexWriter after delete?

2003-11-11 Thread Otis Gospodnetic
1). If I delete a term using an IndexReader, can I use an existing IndexWriter to write to the index? Or do I need to close and reopen the IndexWriter? No. You should close IndexWriter first, then open IndexReader, then call delete, then close IndexReader, and then open a new IndexWriter.

Re: Index pdf files with your content in lucene.

2003-11-11 Thread Otis Gospodnetic
Ernesto, it looks like something got stripped. A ZIP file should make it to the list. If not, maybe you can post it somewhere. Could you also tell us a bit about this code? Is it better than existing PDF/Word parsing solutions? Pure Java? Uses POI? Thanks, Otis --- Ernesto De Santis

Re: Document Clustering

2003-11-11 Thread petite_abeille
On Nov 11, 2003, at 21:32, maurits van wijland wrote: There is the carrot project : http://www.cs.put.poznan.pl/dweiss/carrot/ Leo Galambos, author of the Egothor project, constantly supports us with fresh ideas and includes Carrot components in his own project!

Re: Document Clustering

2003-11-11 Thread Alex Aw Seat Kiong
Hi! I'm also interest it. Kindly CC to me the lastest progress of your clustering project. Regards, AlexAw - Original Message - From: Eric Jain [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, November 11, 2003 10:07 PM Subject: Re: Document Clustering I'm

Re: Document Clustering

2003-11-11 Thread marc
Thanks everyone for the responses and links to resources.. I was basically thinking of using lucene to generate document vectors, and writing my custom similarity algorithms for measuring distance. I could then run this data through k-means or SOM algorithms for calculating clusters Does this

Can use Lucene be used for this

2003-11-11 Thread Kumar Mettu
Hi, I have a huge data file with 4 gb data. The data in the file never changes. The format of the file is as follows: Col1,col2,col3,Value abababc,xyzza,c,100 ababadx,xyz,adfdfd,101 I need to retrieve the value with simple queries on the data like:

Re: Can use Lucene be used for this

2003-11-11 Thread Erik Hatcher
On Tuesday, November 11, 2003, at 10:00 PM, Kumar Mettu wrote: The format of the file is as follows: Col1,col2,col3,Value abababc,xyzza,c,100 ababadx,xyz,adfdfd,101 I need to retrieve the value with simple queries on the data like: select value where col1

Re: Index pdf files with your content in lucene.

2003-11-11 Thread Ernesto De Santis
try again zipping the files. after i post the files in the web site. Could you also tell us a bit about this code? Is it better than existing PDF/Word parsing solutions? Pure Java? Uses POI? This code use existing parsing solution. The intent is make a lucene Document for index pdf and