This description fooled me too once but it hasn't been patched yet? Now it is
[1], please commit.
[1]: https://issues.apache.org/jira/browse/NUTCH-900
On Wednesday 14 July 2010 07:10:47 Mattmann, Chris A (388J) wrote:
No problem, Brad! If you'd like feel free to create an issue in Nutch JIRA
I'm getting the following error on a regular basis with PDFs on Nutch 1.1
2010-07-13 10:57:32,719 ERROR tika.TikaParser - Error parsing
http://www.careeronestop.org/TridionMutlimedia/spring2010_tcm24-5512.pdf
java.io.IOException: expected='endstream' actual=''
Hi Brad,
This might be a POI issue, which is the underlying java library that Tika wraps
for PDF, and in turn Nutch wraps through parse-tika.
You may want to download Apache POI and try parsing the PDF file with it
outside of Nutch and Tika. If it works with the latest version (I think 1.2?)
3 matches
Mail list logo