Re: Index pdf files with your content in lucene.
Hello well, not work zip the files. I can send files, if somebody won, to personal email. And if somebody can post this in a web site, very cool. I don´t post in a web site. Ernesto. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index pdf files with your content in lucene.
Classes for index Pdf and word files in lucene. Ernesto. - Original Message - From: Ernesto De Santis [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, October 29, 2003 12:04 PM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello all, Thans very much Stephan for your valuable help. Attached you will find the PDFDocument, and WordDocument class source code Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, October 28, 2003 11:10 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hi Ernesto, the IndexManager retrieves a list of files of a folder by calling the method getFilesInFolder of CmsObject. This method returns only empty files, i.e. with empty content. To get the content of a pdf file you have to reread the file: f = cms.readFile(f.getAbsolutePath()); Bye, Stephan Am Montag, 27. Oktober 2003 19:18 schrieben Sie: Hello Thanks for the previous reply. Now, i use - version 1.4 of lucene searche module. (the version attached in this list) - new version of registry.xml format for module. (like you write me) - the pdf files are stored with the binary type. But i have the next problem: i can´t make a InputStream for the cmsfile content. For this i write this code in de Document method of my class PDFDocument: - InputStream in = new ByteArrayInputStream(f.getContents()); //f is the parameter CmsFile of the Document method PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i use. in file system work fine. bodyText = extractor.extractText(in); Is correct use ByteArrayInputStream for make a InputStream for a CmsFile? The error ocurr in the third line. In the PDFParcer. the error menssage in tomcat is: java.io.IOException: Error: Header is corrupt '' at PDFParcer.parse at PDFExtractor.extractText at PDFDocument.Document (my class) at. By, and thanks. Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH To: [EMAIL PROTECTED] Sent: Friday, October 24, 2003 4:45 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello Ernesto, i assume you are using the unpatched version 1.3 of the search module. As i mentioned yesterday, the plainDocFactory does only index cmsFiles of type plain but not of type binary. PDF files are stored as binary. I suggest to use the version i posted yesterday. Then your registry.xml would have to look like this: ... docFactories ... docFactory type=plain enabled=true ... /docFactory docFactory type=binary enabled=true fileType name=pdftext extension.pdf/extension classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType /docFactory ... /docFactories Important: The type attribute must match the file types of OpenCms (also defined in the registry.xml). Bye, Stephan - Original Message - From: Ernesto De Santis To: Lucene Users List Cc: [EMAIL PROTECTED] Sent: Thursday, October 23, 2003 4:16 PM Subject: [opencms-dev] Index pdf files with your content in lucene. Hello I am new in opencms and lucene tecnology. I won index pdf files, and index de content of this files. I work in this way: Make a PDFDocument class like JspDocument class. use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs. and write my registry.xml for pdf document, in plainDocFactory tag. fileType name=pdftext extension.pdf/extension !-- This will strip tags before processing -- classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType my PDFDocument content this code: I think that the probrem is how take the content from CmsFile?, what InputStream use? PDFExtractor work with extractText(InputStream) method. public class PDFDocument implements I_DocumentConstants, I_DocumentFactory { public PDFDocument(){ } public Document Document(CmsObject cmsobject, CmsFile cmsfile) throws CmsException { return Document(cmsobject, cmsfile, null); } public Document Document(CmsObject cmsobject, CmsFile cmsfile, HashMap hashmap) throws CmsException { Document document=(new BodylessDocument()).Document(cmsobject, cmsfile); //put de content in the pdf file. String contenido = new String(cmsfile.getContents()); StringBufferInputStream in = new StringBufferInputStream(contenido); // ByteArrayInputStream in = new ByteArrayInputStream(contenido.getBytes
RE: Index pdf files with your content in lucene.
Some of us have corporate firewalls that are stripping out attachments. If possible, put these on a web site somewhere so we can download them. Thanks! -Original Message- From: Ernesto De Santis [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 11, 2003 11:07 AM To: Lucene Users List Subject: Index pdf files with your content in lucene. Classes for index Pdf and word files in lucene. Ernesto. - Original Message - From: Ernesto De Santis [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, October 29, 2003 12:04 PM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello all, Thans very much Stephan for your valuable help. Attached you will find the PDFDocument, and WordDocument class source code Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, October 28, 2003 11:10 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hi Ernesto, the IndexManager retrieves a list of files of a folder by calling the method getFilesInFolder of CmsObject. This method returns only empty files, i.e. with empty content. To get the content of a pdf file you have to reread the file: f = cms.readFile(f.getAbsolutePath()); Bye, Stephan Am Montag, 27. Oktober 2003 19:18 schrieben Sie: Hello Thanks for the previous reply. Now, i use - version 1.4 of lucene searche module. (the version attached in this list) - new version of registry.xml format for module. (like you write me) - the pdf files are stored with the binary type. But i have the next problem: i can´t make a InputStream for the cmsfile content. For this i write this code in de Document method of my class PDFDocument: - InputStream in = new ByteArrayInputStream(f.getContents()); //f is the parameter CmsFile of the Document method PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i use. in file system work fine. bodyText = extractor.extractText(in); Is correct use ByteArrayInputStream for make a InputStream for a CmsFile? The error ocurr in the third line. In the PDFParcer. the error menssage in tomcat is: java.io.IOException: Error: Header is corrupt '' at PDFParcer.parse at PDFExtractor.extractText at PDFDocument.Document (my class) at. By, and thanks. Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH To: [EMAIL PROTECTED] Sent: Friday, October 24, 2003 4:45 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello Ernesto, i assume you are using the unpatched version 1.3 of the search module. As i mentioned yesterday, the plainDocFactory does only index cmsFiles of type plain but not of type binary. PDF files are stored as binary. I suggest to use the version i posted yesterday. Then your registry.xml would have to look like this: ... docFactories ... docFactory type=plain enabled=true ... /docFactory docFactory type=binary enabled=true fileType name=pdftext extension.pdf/extension classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType /docFactory ... /docFactories Important: The type attribute must match the file types of OpenCms (also defined in the registry.xml). Bye, Stephan - Original Message - From: Ernesto De Santis To: Lucene Users List Cc: [EMAIL PROTECTED] Sent: Thursday, October 23, 2003 4:16 PM Subject: [opencms-dev] Index pdf files with your content in lucene. Hello I am new in opencms and lucene tecnology. I won index pdf files, and index de content of this files. I work in this way: Make a PDFDocument class like JspDocument class. use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs. and write my registry.xml for pdf document, in plainDocFactory tag. fileType name=pdftext extension.pdf/extension !-- This will strip tags before processing -- classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType my PDFDocument content this code: I think that the probrem is how take the content from CmsFile?, what InputStream use? PDFExtractor work with extractText(InputStream) method. public class PDFDocument implements I_DocumentConstants, I_DocumentFactory { public PDFDocument(){ } public Document Document(CmsObject cmsobject, CmsFile cmsfile) throws CmsException { return Document(cmsobject, cmsfile, null); } public Document Document(CmsObject cmsobject, CmsFile cmsfile, HashMap hashmap) throws CmsException
Re: Index pdf files with your content in lucene.
Ernesto, it looks like something got stripped. A ZIP file should make it to the list. If not, maybe you can post it somewhere. Could you also tell us a bit about this code? Is it better than existing PDF/Word parsing solutions? Pure Java? Uses POI? Thanks, Otis --- Ernesto De Santis [EMAIL PROTECTED] wrote: Classes for index Pdf and word files in lucene. Ernesto. - Original Message - From: Ernesto De Santis [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, October 29, 2003 12:04 PM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello all, Thans very much Stephan for your valuable help. Attached you will find the PDFDocument, and WordDocument class source code Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, October 28, 2003 11:10 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hi Ernesto, the IndexManager retrieves a list of files of a folder by calling the method getFilesInFolder of CmsObject. This method returns only empty files, i.e. with empty content. To get the content of a pdf file you have to reread the file: f = cms.readFile(f.getAbsolutePath()); Bye, Stephan Am Montag, 27. Oktober 2003 19:18 schrieben Sie: Hello Thanks for the previous reply. Now, i use - version 1.4 of lucene searche module. (the version attached in this list) - new version of registry.xml format for module. (like you write me) - the pdf files are stored with the binary type. But i have the next problem: i can´t make a InputStream for the cmsfile content. For this i write this code in de Document method of my class PDFDocument: - InputStream in = new ByteArrayInputStream(f.getContents()); //f is the parameter CmsFile of the Document method PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i use. in file system work fine. bodyText = extractor.extractText(in); Is correct use ByteArrayInputStream for make a InputStream for a CmsFile? The error ocurr in the third line. In the PDFParcer. the error menssage in tomcat is: java.io.IOException: Error: Header is corrupt '' at PDFParcer.parse at PDFExtractor.extractText at PDFDocument.Document (my class) at. By, and thanks. Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH To: [EMAIL PROTECTED] Sent: Friday, October 24, 2003 4:45 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello Ernesto, i assume you are using the unpatched version 1.3 of the search module. As i mentioned yesterday, the plainDocFactory does only index cmsFiles of type plain but not of type binary. PDF files are stored as binary. I suggest to use the version i posted yesterday. Then your registry.xml would have to look like this: ... docFactories ... docFactory type=plain enabled=true ... /docFactory docFactory type=binary enabled=true fileType name=pdftext extension.pdf/extension classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType /docFactory ... /docFactories Important: The type attribute must match the file types of OpenCms (also defined in the registry.xml). Bye, Stephan - Original Message - From: Ernesto De Santis To: Lucene Users List Cc: [EMAIL PROTECTED] Sent: Thursday, October 23, 2003 4:16 PM Subject: [opencms-dev] Index pdf files with your content in lucene. Hello I am new in opencms and lucene tecnology. I won index pdf files, and index de content of this files. I work in this way: Make a PDFDocument class like JspDocument class. use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs. and write my registry.xml for pdf document, in plainDocFactory tag. fileType name=pdftext extension.pdf/extension !-- This will strip tags before processing -- classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType my PDFDocument content this code: I think that the probrem is how take the content from CmsFile?, what InputStream use? PDFExtractor work with extractText(InputStream) method. public class PDFDocument implements I_DocumentConstants, I_DocumentFactory { public PDFDocument(){ } public Document Document(CmsObject cmsobject, CmsFile cmsfile) throws CmsException { return Document
Re: Index pdf files with your content in lucene.
try again zipping the files. after i post the files in the web site. Could you also tell us a bit about this code? Is it better than existing PDF/Word parsing solutions? Pure Java? Uses POI? This code use existing parsing solution. The intent is make a lucene Document for index pdf and word files, with content. Is pure java. Use TextExtraction library. tm-extractors-0.2.jar Use POI and PDFBox. Ernesto Sorry for my bad English. Thanks, Otis --- Ernesto De Santis [EMAIL PROTECTED] wrote: Classes for index Pdf and word files in lucene. Ernesto. - Original Message - From: Ernesto De Santis [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, October 29, 2003 12:04 PM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello all, Thans very much Stephan for your valuable help. Attached you will find the PDFDocument, and WordDocument class source code Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, October 28, 2003 11:10 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hi Ernesto, the IndexManager retrieves a list of files of a folder by calling the method getFilesInFolder of CmsObject. This method returns only empty files, i.e. with empty content. To get the content of a pdf file you have to reread the file: f = cms.readFile(f.getAbsolutePath()); Bye, Stephan Am Montag, 27. Oktober 2003 19:18 schrieben Sie: Hello Thanks for the previous reply. Now, i use - version 1.4 of lucene searche module. (the version attached in this list) - new version of registry.xml format for module. (like you write me) - the pdf files are stored with the binary type. But i have the next problem: i can´t make a InputStream for the cmsfile content. For this i write this code in de Document method of my class PDFDocument: - InputStream in = new ByteArrayInputStream(f.getContents()); //f is the parameter CmsFile of the Document method PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i use. in file system work fine. bodyText = extractor.extractText(in); Is correct use ByteArrayInputStream for make a InputStream for a CmsFile? The error ocurr in the third line. In the PDFParcer. the error menssage in tomcat is: java.io.IOException: Error: Header is corrupt '' at PDFParcer.parse at PDFExtractor.extractText at PDFDocument.Document (my class) at. By, and thanks. Ernesto. - Original Message - From: Hartmann, Waehrisch Feykes GmbH To: [EMAIL PROTECTED] Sent: Friday, October 24, 2003 4:45 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello Ernesto, i assume you are using the unpatched version 1.3 of the search module. As i mentioned yesterday, the plainDocFactory does only index cmsFiles of type plain but not of type binary. PDF files are stored as binary. I suggest to use the version i posted yesterday. Then your registry.xml would have to look like this: ... docFactories ... docFactory type=plain enabled=true ... /docFactory docFactory type=binary enabled=true fileType name=pdftext extension.pdf/extension classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType /docFactory ... /docFactories Important: The type attribute must match the file types of OpenCms (also defined in the registry.xml). Bye, Stephan - Original Message - From: Ernesto De Santis To: Lucene Users List Cc: [EMAIL PROTECTED] Sent: Thursday, October 23, 2003 4:16 PM Subject: [opencms-dev] Index pdf files with your content in lucene. Hello I am new in opencms and lucene tecnology. I won index pdf files, and index de content of this files. I work in this way: Make a PDFDocument class like JspDocument class. use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs. and write my registry.xml for pdf document, in plainDocFactory tag. fileType name=pdftext extension.pdf/extension !-- This will strip tags before processing -- classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType my PDFDocument content this code: I think that the probrem is how take the content from CmsFile?, what
Index pdf files with your content in lucene.
Hello I am new in opencms and lucene tecnology. I won index pdf files, and index de content of this files. I work in this way: Make a PDFDocument class like JspDocument class. use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs. and write my registry.xml for pdf document, in plainDocFactory tag. fileType name=pdftext extension.pdf/extension !-- This will strip tags before processing -- classnet.grcomputing.opencms.search.lucene.PDFDocument/class /fileType my PDFDocument content this code: I think that the probrem is how take the content from CmsFile?, what InputStream use? PDFExtractor work with extractText(InputStream) method. public class PDFDocument implements I_DocumentConstants, I_DocumentFactory { public PDFDocument(){ } public Document Document(CmsObject cmsobject, CmsFile cmsfile) throws CmsException { return Document(cmsobject, cmsfile, null); } public Document Document(CmsObject cmsobject, CmsFile cmsfile, HashMap hashmap) throws CmsException { Document document=(new BodylessDocument()).Document(cmsobject, cmsfile); //put de content in the pdf file. String contenido = new String(cmsfile.getContents()); StringBufferInputStream in = new StringBufferInputStream(contenido); // ByteArrayInputStream in = new ByteArrayInputStream(contenido.getBytes()); /* try{ FileInputStream in = new FileInputStream (cmsfile.getPath() + cmsfile.getName()); */ PDFExtractor extractor = new PDFExtractor(); String body = extractor.extractText(in); document.add(Field.Text(body, body)); /* }catch(FileNotFoundException e){ e.toString(); throw new CmsException(); } */ return (document); } thanks Ernesto PD: Sorry for my poor english. - Original Message - From: Hartmann, Waehrisch Feykes GmbH [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, October 22, 2003 3:50 AM Subject: Re: [opencms-dev] (no subject) Hi Ben, i think this won't work since the plainDocFactory will only be used for files of type plain but not for files of type binary. Recently we have done some additions to the module - by order of Lenord, Bauer Co. GmbH - that could meet your needs. It introduces a more flexible way of defining docFactories that you can add new factories without having to recompile the whole module. So other modules (like the news) can bring their own docFactory and all you have to do is to edit the registry.xml. Here is an example: docFactories docFactory enabled=true type=plain fileType name=plaintext extension.txt/extension classnet.grcomputing.opencms.search.lucene.PlainDocument/class /fileType /docFactory docFactory enabled=true type=news classnet.grcomputing.opencms.search.lucene.NewsDocument/class /docFactory /docFactories To index binary files all you need to add is this: docFactory enabled=true type=binary classnet.grcomputing.opencms.search.lucene.BodylessDocument/class /docFactory There should be no need for an extension mapping. For the interested people: For ContentDefinitions (like news) i introduced the following: contentDefinitions contentDefinition type=news !-- must match docFactory type -- classcom.opencms.modules.homepage.news.NewsContentDefinition/class initClassnet.grcomputing.opencms.search.lucene.NewsInitialization/initCla ss listMethod name=getNewsList param type=java.lang.Integer1/param param type=java.lang.String-1/param /listMethod page uri=/news.html?__element=entry param method=getIntId name=newsid/ /page /contentDefinition In short: initClass is optional: For the news the news classes have to be loaded to initialize the db pool. listMethod: a method of the content definition class that returns a List of elements page: the page that can display an entry. Here a jsp that has a template element entry. It also needs the id of the news item. getIntId is a method of the content definition class and newsid is the url parameter the page needs. A link like news.html?__element=entrynewsid=xy will be generated. Best regards, Stephan - Original Message - From: Ben Rometsch [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, October 22, 2003 6:15 AM Subject: [opencms-dev] (no subject) Hi Matt, I am not having any joy! I've updated my registry.xml file, with the appropriate section reading: luceneSearch mergeFactor10/mergeFactor permChecktrue/permCheck indexDirc:\search/indexDir