if you read the README.txt file, you will read Apache Nutch README
Important note: Due to licensing issues we cannot provide two libraries that are normally provided with PDFBox (jai_core.jar, jai_codec.jar), the parser library we use for parsing PDF files. If you encounter unexpected problems when working with PDF files please 1. download the two missing libraries from: http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/external/ 2. Put them to directory src/plugin/parse-pdf/lib 3. follow the instructions in file src/plugin/parse-pdf/plugin.xml 4. Rebuild nutch. Peter van Dijk schrieb: > After using nutch for a while; i figured that some pdf files can't be indexed: > > java.lang.NoClassDefFoundError: javax/media/jai/PlanarImage > > I've already fixed my pdf plugins so that the jai_core.jar and jai_codec.jar > files are in place, and the respective lines in parse-pdf/plugin.xml are > uncommented. However some pdf continue to fail?! Other pdf files just work! > > _________________________________________________________________ > Al je email accounts in 1 inbox. Het kan in Hotmail. > http://www.microsoft.com/netherlands/windowslive/Views/productdetail.aspx?product=Hotmail >

