Don Vaillancourt wrote:
I used the following code example from an article that I linked off of jakarta's site to index PDF files:

doc.add(Field.Text("content", new FileReader(f)));

But I realized today that this method only indexes the PDF as is. For those wondering if the the PDF were actually indexed or if maybe they only contained images, well I verified this with LUKE and those PDFs are in there, but the only keywords that were indexed were the PDF defintion statements and encoded stuff.

So what is the proper way to index a PDF?

The proper way is to first pass the PDF file through a PDF parser (e.g. PDFBox), and then extract plain-text content (such as body, title, author, etc), and only then add that plaintext content to the index.



-- Best regards, Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to