Re: PDF text extraction problems

Julien Nioche Tue, 11 Jan 2011 04:43:25 -0800

This has been mentioned several times on the list

Probably due to the fetch size limit. The default value in Nutch is

*<property>
  <name>http.content.limit</*
*name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>
*
try setting -1 to see if this solves the issue.

You can also test the parsing using : bin/nutch
org.apache.nutch.parse.ParserChecker blablabla.pdf

or by calling Tika directly on a URL e.g.

/usr/local/bin/tika-0.7/tika-app/target/tika-app-0.7.jar blablabla.pdf

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

On 11 January 2011 12:18, Peter Litsegård <[email protected]> wrote:

> Hi!
>
> I'm running Nutch v1.2 and experience problems while trying to index
> PDF-documents. The error I receive is:
>
> Error parsing: <docname>.pdf: failed(2,0): expected='endstream' actual=''
> org.apache.pdfbox.io.pushbackinputstr...@cbf92
>
> I've inspected the security settings and printing/content copying/page
> extraction are all allowed. While inspecting the document properties I see:
>
>        - PDF Producer: Adobe PDF Library 9.9
>        - PDF Version: 1.5 (Acrobat 5.x)
>
> What might be the culprit here?
>
> Thanks in advance!
> /Peter

Re: PDF text extraction problems

Reply via email to