Hi Chris,
I do not have a stats but I think the performance
is reasonable. I use xpdf for PDF wvWare for DOC.
The size of my index is ~2GB (this is not limited to
only pdf doc). For avoiding memory problems, I have
set an upperbound to the size of the documents that
can be indexed. For
Hi,
I know some of them.
1. PDF
+ http://www.pdfbox.org/
+ http://www.foolabs.com/xpdf/download.html
- I am using this and found good. It even supports
various languages.
2. word
+ http://sourceforge.net/projects/wvware
3. excel
+ http://www.jguru.com/faq/view.jsp?EID=1074230
For Word see the tm-extractor at www.text-mining.org (based on POI). Pretty simple to
use.
-Message d'origine-
De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Envoyé : jeudi 9 septembre 2004 15:47
À : Lucene Users List
Objet : Existing Parsers
Anyone know of any reliable parsers out
Honey George wrote:
Hi,
I know some of them.
1. PDF
+ http://www.pdfbox.org/
+ http://www.foolabs.com/xpdf/download.html
- I am using this and found good. It even supports
My dated experience from 2 years ago was that (the evil, native code)
foolabs pdf parser was the best, but obviously
Some of the tools listed use cmd line execs to output a doc of some
sort to text and then I grab the text and add it to a lucene doc, etc
etc...
Any stats on the scalability of that? In large scale applications, I'm
assuming this will cause some serious issues... anyone have any input
on this?