Re: Existing Parsers

2004-09-13 Thread Honey George
Hi Chris, I do not have a stats but I think the performance is reasonable. I use xpdf for PDF wvWare for DOC. The size of my index is ~2GB (this is not limited to only pdf doc). For avoiding memory problems, I have set an upperbound to the size of the documents that can be indexed. For

Re: Existing Parsers

2004-09-09 Thread Honey George
Hi, I know some of them. 1. PDF + http://www.pdfbox.org/ + http://www.foolabs.com/xpdf/download.html - I am using this and found good. It even supports various languages. 2. word + http://sourceforge.net/projects/wvware 3. excel + http://www.jguru.com/faq/view.jsp?EID=1074230

RE: Existing Parsers

2004-09-09 Thread Cocula Remi
For Word see the tm-extractor at www.text-mining.org (based on POI). Pretty simple to use. -Message d'origine- De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Envoyé : jeudi 9 septembre 2004 15:47 À : Lucene Users List Objet : Existing Parsers Anyone know of any reliable parsers out

Re: Existing Parsers

2004-09-09 Thread David Spencer
Honey George wrote: Hi, I know some of them. 1. PDF + http://www.pdfbox.org/ + http://www.foolabs.com/xpdf/download.html - I am using this and found good. It even supports My dated experience from 2 years ago was that (the evil, native code) foolabs pdf parser was the best, but obviously

Re: Existing Parsers

2004-09-09 Thread Chris Fraschetti
Some of the tools listed use cmd line execs to output a doc of some sort to text and then I grab the text and add it to a lucene doc, etc etc... Any stats on the scalability of that? In large scale applications, I'm assuming this will cause some serious issues... anyone have any input on this?