SV: PDF text extraction problems

Peter Litsegård Tue, 11 Jan 2011 05:25:08 -0800

Many thanks!

I had specified "file.content.limit" instead of "http.content.limit". This 
fixed my problem and it works wonderfully!!!!!

Again, many thanks!
/Peter 

-----Ursprungligt meddelande-----
Från: Julien Nioche [mailto:[email protected]] 
Skickat: den 11 januari 2011 13:43
Till: [email protected]
Ämne: Re: PDF text extraction problems

This has been mentioned several times on the list

Probably due to the fetch size limit. The default value in Nutch is

*<property>
  <name>http.content.limit</*
*name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>
*
try setting -1 to see if this solves the issue.

You can also test the parsing using : bin/nutch 
org.apache.nutch.parse.ParserChecker blablabla.pdf

or by calling Tika directly on a URL e.g.

/usr/local/bin/tika-0.7/tika-app/target/tika-app-0.7.jar blablabla.pdf

Julien

--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

On 11 January 2011 12:18, Peter Litsegård <[email protected]> wrote:

> Hi!
>
> I'm running Nutch v1.2 and experience problems while trying to index 
> PDF-documents. The error I receive is:
>
> Error parsing: <docname>.pdf: failed(2,0): expected='endstream' actual=''
> org.apache.pdfbox.io.pushbackinputstr...@cbf92
>
> I've inspected the security settings and printing/content copying/page 
> extraction are all allowed. While inspecting the document properties I see:
>
>        - PDF Producer: Adobe PDF Library 9.9
>        - PDF Version: 1.5 (Acrobat 5.x)
>
> What might be the culprit here?
>
> Thanks in advance!
> /Peter

SV: PDF text extraction problems

Reply via email to