if you read the README.txt file, you will read

Apache Nutch README

Important note: Due to licensing issues we cannot provide two libraries that
are normally provided with PDFBox (jai_core.jar, jai_codec.jar), the parser
library we use for parsing PDF files. If you encounter unexpected
problems when
working with PDF files please

1. download the two missing libraries  from:
   http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/external/

2. Put them to directory src/plugin/parse-pdf/lib
3. follow the instructions in file src/plugin/parse-pdf/plugin.xml
4. Rebuild nutch.


Peter van Dijk schrieb:
> After using nutch for a while; i figured that some pdf files can't be indexed:
>
> java.lang.NoClassDefFoundError: javax/media/jai/PlanarImage
>
> I've already fixed my pdf plugins so that the jai_core.jar and jai_codec.jar 
> files are in place, and the respective lines in parse-pdf/plugin.xml are 
> uncommented. However some pdf continue to fail?! Other pdf files just work!
>                                         
> _________________________________________________________________
> Al je email accounts in 1 inbox. Het kan in Hotmail.
> http://www.microsoft.com/netherlands/windowslive/Views/productdetail.aspx?product=Hotmail
>   

Reply via email to