Document Clustering
Hi, does anyone have any sample code/documentation available for doing document based clustering using lucene? Thanks, Marc
Re: Document Clustering
Hi Marc, I'm working on it. Classification and Clustering as well. I was planing doing it for nutch.org, but actually some guys there breakup some important basic work I already had done, so may be i will not contribute it there. However it will be open source and I can notice you if something useful is ready. But it will take some weeks. I actually working on radical minimizing of feature selection Cheers Stefan marc wrote: Hi, does anyone have any sample code/documentation available for doing document based clustering using lucene? Thanks, Marc -- day time: www.media-style.com spare time: www.text-mining.org | www.weta-group.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
I'm working on it. Classification and Clustering as well. Very interesting... if you get something working, please don't forget to notify this list :-) -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Hi As everybody seems to be so exited about it, would someone please be so kind to explain what document based clustering is? Regards, Marcel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Marcel Stör wrote: Hi As everybody seems to be so exited about it, would someone please be so kind to explain what document based clustering is? Hi they are trying to implement what you can see in the right panel here: http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein They may also analyze identical pages (hit #9 and #10) - this could be also taken as clustering AFAIK. For instance, Doug wrote some papers about clustering (if I remember it correctly) - see his bibliography. Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
--- Leo Galambos [EMAIL PROTECTED] wrote: Marcel Stör wrote: Hi As everybody seems to be so exited about it, would someone please be so kind to explain what document based clustering is? AFAIK, document clustering consists of detection of documents with similar content (similar subjects/topics). Hi they are trying to implement what you can see in the right panel here: http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein They may also analyze identical pages (hit #9 and #10) - this could be also taken as clustering AFAIK. Intersting. For instance, Doug wrote some papers about clustering (if I remember it correctly) - see his bibliography. How is document clustering different/related to text categorization? Thanks, Otis __ Do you Yahoo!? Protect your identity with Yahoo! Mail AddressGuard http://antispam.yahoo.com/whatsnewfree - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
On Nov 11, 2003, at 16:05, Marcel Stör wrote: As everybody seems to be so exited about it, would someone please be so kind to explain what document based clustering is? This mostly means finding document which are similar in some way(s). The similitude is mostly in the eyes of the beholder. In such a world, a cluster would be a pile of document sharing something. As far as Lucene goes, a straightforward way of approaching this could be to use an entire document content to query an index. Lucene's result set could be construed as a document cluster. Admittedly, this is ground zero of document clustering, but here you go anyway :) Here is an illustration: Patterns in Unstructured Data Discovery, Aggregation, and Visualization http://javelina.cet.middlebury.edu/lsa/out/cover_page.htm Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Document Clustering
Categorization typically assigns documents to a node in a pre-defined taxonomy. For clustering, however, the categorization 'structure' is emergent... i.e. the clusters (which are analogous to taxonomy nodes) are created dynamically based on the content of the documents at hand. -Original Message- From: petite_abeille [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 11, 2003 10:50 AM To: Lucene Users List Subject: Re: Document Clustering Hi Otis, On Nov 11, 2003, at 16:41, Otis Gospodnetic wrote: How is document clustering different/related to text categorization? Not that I'm an expert in any of this, but clustering is a much more holistic approach than categorization. Usually, categorization is understood as a more precise endeavor (e.g. dmoz.org), while clustering is much more fuzzy and non-deterministic. Both try to achieve the same goal though. So perhaps this is just a question of jargon. I'm confident that the owner of this site could help bring some light on the finer point of clustering vs categorization: http://www.lissus.com/resources/index.htm Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
On Nov 11, 2003, at 16:58, Tate Avery wrote: Categorization typically assigns documents to a node in a pre-defined taxonomy. For clustering, however, the categorization 'structure' is emergent... i.e. the clusters (which are analogous to taxonomy nodes) are created dynamically based on the content of the documents at hand. Another way to look at it is this: An attempt to apply the Dewey Decimal system to an orgy. [1] Without a Dewey Decimal system that is. Cheers, PA. [1] http://www.eod.com/devil/archive/semantic_web.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Classification: you have already categories and samples for it, that help you to match other documents. You calculate document distances to the existing categories and put it in the category with smallest distance. Cheers Stefan -- day time: www.media-style.com spare time: www.text-mining.org | www.weta-group.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Thanks for the clarification, Stefan. I should have known that... :) Otis --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Classification: you have already categories and samples for it, that help you to match other documents. You calculate document distances to the existing categories and put it in the category with smallest distance. Cheers Stefan -- day time: www.media-style.com spare time: www.text-mining.org | www.weta-group.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Protect your identity with Yahoo! Mail AddressGuard http://antispam.yahoo.com/whatsnewfree - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Reopen IndexWriter after delete?
Hi, A couple questions... 1). If I delete a term using an IndexReader, can I use an existing IndexWriter to write to the index? Or do I need to close and reopen the IndexWriter? 2). Is it safe to call IndexReader.delete(term) while an IndexWriter is writing? Or should I be synchronizing these two tasks so only one occurs at a time? Any help is appreciated! -Reece - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index pdf files with your content in lucene.
Classes for index Pdf and word files in lucene. Ernesto. - Original Message - From: Ernesto De Santis [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, October 29, 2003 12:04 PM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello all, Thans very much Stephan for your valuable help. Attached you will find the PDFDocument, and WordDocument class source code Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, October 28, 2003 11:10 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hi Ernesto, the IndexManager retrieves a list of files of a folder by calling the method getFilesInFolder of CmsObject. This method returns only empty files, i.e. with empty content. To get the content of a pdf file you have to reread the file: f = cms.readFile(f.getAbsolutePath()); Bye, Stephan Am Montag, 27. Oktober 2003 19:18 schrieben Sie: Hello Thanks for the previous reply. Now, i use - version 1.4 of lucene searche module. (the version attached in this list) - new version of registry.xml format for module. (like you write me) - the pdf files are stored with the binary type. But i have the next problem: i can´t make a InputStream for the cmsfile content. For this i write this code in de Document method of my class PDFDocument: - InputStream in = new ByteArrayInputStream(f.getContents()); //f is the parameter CmsFile of the Document method PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i use. in file system work fine. bodyText = extractor.extractText(in); Is correct use ByteArrayInputStream for make a InputStream for a CmsFile? The error ocurr in the third line. In the PDFParcer. the error menssage in tomcat is: java.io.IOException: Error: Header is corrupt '' at PDFParcer.parse at PDFExtractor.extractText at PDFDocument.Document (my class) at. By, and thanks. Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH To: [EMAIL PROTECTED] Sent: Friday, October 24, 2003 4:45 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello Ernesto, i assume you are using the unpatched version 1.3 of the search module. As i mentioned yesterday, the plainDocFactory does only index cmsFiles of type plain but not of type binary. PDF files are stored as binary. I suggest to use the version i posted yesterday. Then your registry.xml would have to look like this: ... docFactories ... docFactory type=plain enabled=true ... /docFactory docFactory type=binary enabled=true fileType name=pdftext extension.pdf/extension classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType /docFactory ... /docFactories Important: The type attribute must match the file types of OpenCms (also defined in the registry.xml). Bye, Stephan - Original Message - From: Ernesto De Santis To: Lucene Users List Cc: [EMAIL PROTECTED] Sent: Thursday, October 23, 2003 4:16 PM Subject: [opencms-dev] Index pdf files with your content in lucene. Hello I am new in opencms and lucene tecnology. I won index pdf files, and index de content of this files. I work in this way: Make a PDFDocument class like JspDocument class. use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs. and write my registry.xml for pdf document, in plainDocFactory tag. fileType name=pdftext extension.pdf/extension !-- This will strip tags before processing -- classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType my PDFDocument content this code: I think that the probrem is how take the content from CmsFile?, what InputStream use? PDFExtractor work with extractText(InputStream) method. public class PDFDocument implements I_DocumentConstants, I_DocumentFactory { public PDFDocument(){ } public Document Document(CmsObject cmsobject, CmsFile cmsfile) throws CmsException { return Document(cmsobject, cmsfile, null); } public Document Document(CmsObject cmsobject, CmsFile cmsfile, HashMap hashmap) throws CmsException { Document document=(new BodylessDocument()).Document(cmsobject, cmsfile); //put de content in the pdf file. String contenido = new String(cmsfile.getContents()); StringBufferInputStream in = new StringBufferInputStream(contenido); // ByteArrayInputStream in = new ByteArrayInputStream(contenido.getBytes());
RE: Document Clustering
Stefan Groschupf wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Would I be correct to say that you have to define a distance threshold parameter in order to define when to build a new category for a certain group? Classification: you have already categories and samples for it, that help you to match other documents. You calculate document distances to the existing categories and put it in the category with smallest distance. Regards, Marcel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
On Tuesday, Nov 11, 2003, at 11:05 US/Pacific, Marcel Stor wrote: Stefan Groschupf wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Would I be correct to say that you have to define a distance threshold parameter in order to define when to build a new category for a certain group? Depends on the type of clustering algorithm. Some clustering algorithms take the number of clusters as a parameter (in this case the algorithm may be run several times with different values, to determine the best value). Other types of algorithms, such as hierarchical agglomerative clustering algorithms, work more as you suggest. Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index pdf files with your content in lucene.
Some of us have corporate firewalls that are stripping out attachments. If possible, put these on a web site somewhere so we can download them. Thanks! -Original Message- From: Ernesto De Santis [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 11, 2003 11:07 AM To: Lucene Users List Subject: Index pdf files with your content in lucene. Classes for index Pdf and word files in lucene. Ernesto. - Original Message - From: Ernesto De Santis [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, October 29, 2003 12:04 PM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello all, Thans very much Stephan for your valuable help. Attached you will find the PDFDocument, and WordDocument class source code Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, October 28, 2003 11:10 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hi Ernesto, the IndexManager retrieves a list of files of a folder by calling the method getFilesInFolder of CmsObject. This method returns only empty files, i.e. with empty content. To get the content of a pdf file you have to reread the file: f = cms.readFile(f.getAbsolutePath()); Bye, Stephan Am Montag, 27. Oktober 2003 19:18 schrieben Sie: Hello Thanks for the previous reply. Now, i use - version 1.4 of lucene searche module. (the version attached in this list) - new version of registry.xml format for module. (like you write me) - the pdf files are stored with the binary type. But i have the next problem: i can´t make a InputStream for the cmsfile content. For this i write this code in de Document method of my class PDFDocument: - InputStream in = new ByteArrayInputStream(f.getContents()); //f is the parameter CmsFile of the Document method PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i use. in file system work fine. bodyText = extractor.extractText(in); Is correct use ByteArrayInputStream for make a InputStream for a CmsFile? The error ocurr in the third line. In the PDFParcer. the error menssage in tomcat is: java.io.IOException: Error: Header is corrupt '' at PDFParcer.parse at PDFExtractor.extractText at PDFDocument.Document (my class) at. By, and thanks. Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH To: [EMAIL PROTECTED] Sent: Friday, October 24, 2003 4:45 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello Ernesto, i assume you are using the unpatched version 1.3 of the search module. As i mentioned yesterday, the plainDocFactory does only index cmsFiles of type plain but not of type binary. PDF files are stored as binary. I suggest to use the version i posted yesterday. Then your registry.xml would have to look like this: ... docFactories ... docFactory type=plain enabled=true ... /docFactory docFactory type=binary enabled=true fileType name=pdftext extension.pdf/extension classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType /docFactory ... /docFactories Important: The type attribute must match the file types of OpenCms (also defined in the registry.xml). Bye, Stephan - Original Message - From: Ernesto De Santis To: Lucene Users List Cc: [EMAIL PROTECTED] Sent: Thursday, October 23, 2003 4:16 PM Subject: [opencms-dev] Index pdf files with your content in lucene. Hello I am new in opencms and lucene tecnology. I won index pdf files, and index de content of this files. I work in this way: Make a PDFDocument class like JspDocument class. use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs. and write my registry.xml for pdf document, in plainDocFactory tag. fileType name=pdftext extension.pdf/extension !-- This will strip tags before processing -- classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType my PDFDocument content this code: I think that the probrem is how take the content from CmsFile?, what InputStream use? PDFExtractor work with extractText(InputStream) method. public class PDFDocument implements I_DocumentConstants, I_DocumentFactory { public PDFDocument(){ } public Document Document(CmsObject cmsobject, CmsFile cmsfile) throws CmsException { return Document(cmsobject, cmsfile, null); } public Document Document(CmsObject cmsobject, CmsFile cmsfile, HashMap hashmap) throws CmsException {
Re: Document Clustering
Marcel Stor wrote: Stefan Groschupf wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Would I be correct to say that you have to define a distance threshold parameter in order to define when to build a new category for a certain group? I'm not sure. There are different data mining algorithms that could be used. Depends on this algoritm. I prefer Support vector machines(SVM). There you calculate distances of multi demensional vectors in a multidemensional room. One vector represent one document. Stefan
fuzzy searches
Hello , now that the topic is clustering methods: has there been any effort in implementing Latent semantic indexing in Lucene? Google only indicates someone else asking this in february. Is there an overview of the structure of the index of lucene despite of the javadoc or any other fast access to understanding what happens inside lucene? regards thomas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: fuzzy searches
Thomas Krämer wrote: Is there an overview of the structure of the index of lucene despite of the javadoc or any other fast access to understanding what happens inside lucene? You mean something like this?: http://jakarta.apache.org/lucene/docs/fileformats.html cheers, Gerret - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: fuzzy searches
Thomas Krämer wrote: now that the topic is clustering methods: has there been any effort in implementing Latent semantic indexing in Lucene? Google only indicates someone else asking this in february. Just a note the LSI is encumbered by US patents 4,839,853 and 5,301,109. It would be wise to make sure that any implementation is either blessed by the patent holders or does not infringe on the patents. Regards, Bruce Ritchie smime.p7s Description: S/MIME Cryptographic Signature
Re: fuzzy searches
On Tuesday, November 11, 2003, at 02:37 PM, Thomas Krämer wrote: Is there an overview of the structure of the index of lucene despite of the javadoc or any other fast access to understanding what happens inside lucene? Here is what is inside a Lucene index: http://jakarta.apache.org/lucene/docs/fileformats.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Hi All and Marc, There is the carrot project : http://www.cs.put.poznan.pl/dweiss/carrot/ The carrot system consists of webservices that can easily be fed by a lucene resultlist. You simply have to create a JSP that creates this XML file and create a custom process and input component. The input component for lucene could look like: ?xml version=1.0 encoding=UTF-8? service xmlns = http://www.dawidweiss.com/projects/carrot/componentDescriptor; framework = Carrot2 component id = carrot2.input.lucene type = input serviceURL = http://localhost/weblucene/c2.jsp; infoURL = http://localhost/weblucene/; / /service The c2.jsp file simply has to translate a resultlist into an XLM file such as: searchresult document id=1 title.../title weight1.0/weight urlhttp://.../url summarysum 1/summary snippetsnip 2/snippet /document document id=2 title.../title weight1.0/weight urlhttp://.../url summarysum 2/summary snippetsnip 2/snippet /document /searchresult Feed this into the carrot system, and you will get a nice clustered result list. The amazing part is of this clustering mechanism is that the cluster labels are incredible, their great! Then there is a open source project called Classifier4J that can be used for classification, the oposite of clustering. These other open source projects are a great addition to the Lucene system. I hope this helps... Marc, what are you building?? Maybe we can help! Kind regards, Maurits - Original Message - From: marc [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, November 11, 2003 5:15 PM Subject: Document Clustering Hi, does anyone have any sample code/documentation available for doing document based clustering using lucene? Thanks, Marc - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
really cool Stuff!!! maurits van wijland wrote: Hi All and Marc, There is the carrot project : http://www.cs.put.poznan.pl/dweiss/carrot/ The carrot system consists of webservices that can easily be fed by a lucene resultlist. You simply have to create a JSP that creates this XML file and create a custom process and input component. The input component for lucene could look like: ?xml version=1.0 encoding=UTF-8? service xmlns = http://www.dawidweiss.com/projects/carrot/componentDescriptor; framework = Carrot2 component id = carrot2.input.lucene type = input serviceURL = http://localhost/weblucene/c2.jsp; infoURL = http://localhost/weblucene/; / /service The c2.jsp file simply has to translate a resultlist into an XLM file such as: searchresult document id=1 title.../title weight1.0/weight urlhttp://.../url summarysum 1/summary snippetsnip 2/snippet /document document id=2 title.../title weight1.0/weight urlhttp://.../url summarysum 2/summary snippetsnip 2/snippet /document /searchresult Feed this into the carrot system, and you will get a nice clustered result list. The amazing part is of this clustering mechanism is that the cluster labels are incredible, their great! Then there is a open source project called Classifier4J that can be used for classification, the oposite of clustering. These other open source projects are a great addition to the Lucene system. I hope this helps... Marc, what are you building?? Maybe we can help! Kind regards, Maurits - Original Message - From: marc [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, November 11, 2003 5:15 PM Subject: Document Clustering Hi, does anyone have any sample code/documentation available for doing document based clustering using lucene? Thanks, Marc - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- day time: www.media-style.com spare time: www.text-mining.org | www.weta-group.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reopen IndexWriter after delete?
1). If I delete a term using an IndexReader, can I use an existing IndexWriter to write to the index? Or do I need to close and reopen the IndexWriter? No. You should close IndexWriter first, then open IndexReader, then call delete, then close IndexReader, and then open a new IndexWriter. 2). Is it safe to call IndexReader.delete(term) while an IndexWriter is writing? Or should I be synchronizing these two tasks so only one occurs at a time? No, it is not safe. You should close the IndexWriter, then delete the document and close IndexReader, and then get a new IndexWriter and continue writing. Incidentally, I just wrote a section about concurrency issues and about locking in Lucene for the upcoming Lucene book. Otis __ Do you Yahoo!? Protect your identity with Yahoo! Mail AddressGuard http://antispam.yahoo.com/whatsnewfree - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index pdf files with your content in lucene.
Ernesto, it looks like something got stripped. A ZIP file should make it to the list. If not, maybe you can post it somewhere. Could you also tell us a bit about this code? Is it better than existing PDF/Word parsing solutions? Pure Java? Uses POI? Thanks, Otis --- Ernesto De Santis [EMAIL PROTECTED] wrote: Classes for index Pdf and word files in lucene. Ernesto. - Original Message - From: Ernesto De Santis [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, October 29, 2003 12:04 PM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello all, Thans very much Stephan for your valuable help. Attached you will find the PDFDocument, and WordDocument class source code Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, October 28, 2003 11:10 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hi Ernesto, the IndexManager retrieves a list of files of a folder by calling the method getFilesInFolder of CmsObject. This method returns only empty files, i.e. with empty content. To get the content of a pdf file you have to reread the file: f = cms.readFile(f.getAbsolutePath()); Bye, Stephan Am Montag, 27. Oktober 2003 19:18 schrieben Sie: Hello Thanks for the previous reply. Now, i use - version 1.4 of lucene searche module. (the version attached in this list) - new version of registry.xml format for module. (like you write me) - the pdf files are stored with the binary type. But i have the next problem: i can´t make a InputStream for the cmsfile content. For this i write this code in de Document method of my class PDFDocument: - InputStream in = new ByteArrayInputStream(f.getContents()); //f is the parameter CmsFile of the Document method PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i use. in file system work fine. bodyText = extractor.extractText(in); Is correct use ByteArrayInputStream for make a InputStream for a CmsFile? The error ocurr in the third line. In the PDFParcer. the error menssage in tomcat is: java.io.IOException: Error: Header is corrupt '' at PDFParcer.parse at PDFExtractor.extractText at PDFDocument.Document (my class) at. By, and thanks. Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH To: [EMAIL PROTECTED] Sent: Friday, October 24, 2003 4:45 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello Ernesto, i assume you are using the unpatched version 1.3 of the search module. As i mentioned yesterday, the plainDocFactory does only index cmsFiles of type plain but not of type binary. PDF files are stored as binary. I suggest to use the version i posted yesterday. Then your registry.xml would have to look like this: ... docFactories ... docFactory type=plain enabled=true ... /docFactory docFactory type=binary enabled=true fileType name=pdftext extension.pdf/extension classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType /docFactory ... /docFactories Important: The type attribute must match the file types of OpenCms (also defined in the registry.xml). Bye, Stephan - Original Message - From: Ernesto De Santis To: Lucene Users List Cc: [EMAIL PROTECTED] Sent: Thursday, October 23, 2003 4:16 PM Subject: [opencms-dev] Index pdf files with your content in lucene. Hello I am new in opencms and lucene tecnology. I won index pdf files, and index de content of this files. I work in this way: Make a PDFDocument class like JspDocument class. use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs. and write my registry.xml for pdf document, in plainDocFactory tag. fileType name=pdftext extension.pdf/extension !-- This will strip tags before processing -- classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType my PDFDocument content this code: I think that the probrem is how take the content from CmsFile?, what InputStream use? PDFExtractor work with extractText(InputStream) method. public class PDFDocument implements I_DocumentConstants, I_DocumentFactory { public PDFDocument(){ } public Document Document(CmsObject cmsobject, CmsFile cmsfile) throws CmsException { return
Re: Document Clustering
On Nov 11, 2003, at 21:32, maurits van wijland wrote: There is the carrot project : http://www.cs.put.poznan.pl/dweiss/carrot/ Leo Galambos, author of the Egothor project, constantly supports us with fresh ideas and includes Carrot components in his own project! http://www.cs.put.poznan.pl/dweiss/carrot/xml/authors.xml?lang=en Small world :) PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Hi! I'm also interest it. Kindly CC to me the lastest progress of your clustering project. Regards, AlexAw - Original Message - From: Eric Jain [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, November 11, 2003 10:07 PM Subject: Re: Document Clustering I'm working on it. Classification and Clustering as well. Very interesting... if you get something working, please don't forget to notify this list :-) -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Thanks everyone for the responses and links to resources.. I was basically thinking of using lucene to generate document vectors, and writing my custom similarity algorithms for measuring distance. I could then run this data through k-means or SOM algorithms for calculating clusters Does this sound like i'm on the right track...i'm still just in the *thinking* stage. Marc - Original Message - From: Alex Aw Seat Kiong [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, November 11, 2003 5:47 PM Subject: Re: Document Clustering Hi! I'm also interest it. Kindly CC to me the lastest progress of your clustering project. Regards, AlexAw - Original Message - From: Eric Jain [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, November 11, 2003 10:07 PM Subject: Re: Document Clustering I'm working on it. Classification and Clustering as well. Very interesting... if you get something working, please don't forget to notify this list :-) -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Can use Lucene be used for this
Hi, I have a huge data file with 4 gb data. The data in the file never changes. The format of the file is as follows: Col1,col2,col3,Value abababc,xyzza,c,100 ababadx,xyz,adfdfd,101 I need to retrieve the value with simple queries on the data like: select value where col1 like %ab, col2 like %aa% and col3 sounds like ; Is Lucene suitable for doing this kind of tasks? I am using DB currently for this. Wondering whether Lucene can be used for this. Thanks, Kumar. - Do you Yahoo!? Protect your identity with Yahoo! Mail AddressGuard
Re: Can use Lucene be used for this
On Tuesday, November 11, 2003, at 10:00 PM, Kumar Mettu wrote: The format of the file is as follows: Col1,col2,col3,Value abababc,xyzza,c,100 ababadx,xyz,adfdfd,101 I need to retrieve the value with simple queries on the data like: select value where col1 like %ab, col2 like %aa% and col3 sounds like ; Is Lucene suitable for doing this kind of tasks? I am using DB currently for this. Wondering whether Lucene can be used for this. It's not a straightforward use of Lucene to emulate that type of query. The trickiest one is the sounds like. The FuzzyQuery in Lucene is close, but not quite a soudns like. You could use WildcardQuerys for the like clauses, but they might be better served with more sophisticated analysis that puts all combinations (a, ab, aba, abab) as terms. There are certainly tricks that could be played at either indexing analysis or query analysis times that could do what you want. Would it be faster than a fast database with that large of a dataset? I'm not sure. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index pdf files with your content in lucene.
try again zipping the files. after i post the files in the web site. Could you also tell us a bit about this code? Is it better than existing PDF/Word parsing solutions? Pure Java? Uses POI? This code use existing parsing solution. The intent is make a lucene Document for index pdf and word files, with content. Is pure java. Use TextExtraction library. tm-extractors-0.2.jar Use POI and PDFBox. Ernesto Sorry for my bad English. Thanks, Otis --- Ernesto De Santis [EMAIL PROTECTED] wrote: Classes for index Pdf and word files in lucene. Ernesto. - Original Message - From: Ernesto De Santis [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, October 29, 2003 12:04 PM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello all, Thans very much Stephan for your valuable help. Attached you will find the PDFDocument, and WordDocument class source code Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, October 28, 2003 11:10 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hi Ernesto, the IndexManager retrieves a list of files of a folder by calling the method getFilesInFolder of CmsObject. This method returns only empty files, i.e. with empty content. To get the content of a pdf file you have to reread the file: f = cms.readFile(f.getAbsolutePath()); Bye, Stephan Am Montag, 27. Oktober 2003 19:18 schrieben Sie: Hello Thanks for the previous reply. Now, i use - version 1.4 of lucene searche module. (the version attached in this list) - new version of registry.xml format for module. (like you write me) - the pdf files are stored with the binary type. But i have the next problem: i can´t make a InputStream for the cmsfile content. For this i write this code in de Document method of my class PDFDocument: - InputStream in = new ByteArrayInputStream(f.getContents()); //f is the parameter CmsFile of the Document method PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i use. in file system work fine. bodyText = extractor.extractText(in); Is correct use ByteArrayInputStream for make a InputStream for a CmsFile? The error ocurr in the third line. In the PDFParcer. the error menssage in tomcat is: java.io.IOException: Error: Header is corrupt '' at PDFParcer.parse at PDFExtractor.extractText at PDFDocument.Document (my class) at. By, and thanks. Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH To: [EMAIL PROTECTED] Sent: Friday, October 24, 2003 4:45 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello Ernesto, i assume you are using the unpatched version 1.3 of the search module. As i mentioned yesterday, the plainDocFactory does only index cmsFiles of type plain but not of type binary. PDF files are stored as binary. I suggest to use the version i posted yesterday. Then your registry.xml would have to look like this: ... docFactories ... docFactory type=plain enabled=true ... /docFactory docFactory type=binary enabled=true fileType name=pdftext extension.pdf/extension classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType /docFactory ... /docFactories Important: The type attribute must match the file types of OpenCms (also defined in the registry.xml). Bye, Stephan - Original Message - From: Ernesto De Santis To: Lucene Users List Cc: [EMAIL PROTECTED] Sent: Thursday, October 23, 2003 4:16 PM Subject: [opencms-dev] Index pdf files with your content in lucene. Hello I am new in opencms and lucene tecnology. I won index pdf files, and index de content of this files. I work in this way: Make a PDFDocument class like JspDocument class. use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs. and write my registry.xml for pdf document, in plainDocFactory tag. fileType name=pdftext extension.pdf/extension !-- This will strip tags before processing -- classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType my PDFDocument content this code: I think that the probrem is how take the content from CmsFile?, what