Re: ERROR tika.TikaParser org.apache.pdfbox.io.PushBackInputStream

2010-09-08 Thread Markus Jelsma
This description fooled me too once but it hasn't been patched yet? Now it is [1], please commit. [1]: https://issues.apache.org/jira/browse/NUTCH-900 On Wednesday 14 July 2010 07:10:47 Mattmann, Chris A (388J) wrote: No problem, Brad! If you'd like feel free to create an issue in Nutch JIRA

ERROR tika.TikaParser org.apache.pdfbox.io.PushBackInputStream

2010-07-13 Thread brad
I'm getting the following error on a regular basis with PDFs on Nutch 1.1 2010-07-13 10:57:32,719 ERROR tika.TikaParser - Error parsing http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf java.io.IOException: expected='endstream' actual=''

Re: ERROR tika.TikaParser org.apache.pdfbox.io.PushBackInputStream

2010-07-13 Thread Mattmann, Chris A (388J)
Hi Brad, This might be a POI issue, which is the underlying java library that Tika wraps for PDF, and in turn Nutch wraps through parse-tika. You may want to download Apache POI and try parsing the PDF file with it outside of Nutch and Tika. If it works with the latest version (I think 1.2?)