Re: Unable to extract PDF content

Julien Nioche Mon, 21 Mar 2011 07:35:17 -0700

I've tried both URLs with nutch-1.3 with

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description/>
</property>


and ./nutch org.apache.nutch.parse.ParserChecker ran without any problems.

It could be that the nutch-site.xml is loaded from conf/ then overriden by
the one found in the job file (which is the reason why we separated the
runtime/deploy from runtime/local in nutch 1.3 and 2.0). Try deleting the
job file or generating a fresh one with 'ant job' and see if this fixes the
issue.

Julien

On 21 March 2011 13:43, Gabriele Kahlout <[email protected]> wrote:

> I'm also having the same issue with nutch-1.2.
>
> $ bin/nutch org.apache.nutch.parse.ParserChecker
> http://www.egamaster.com/datos/politica_fr.pdf
> ---------
> Url
> ---------------
> http://www.egamaster.com/datos/politica_fr.pdf---------
> ParseData
> ---------
> Version: 5
> Status: *failed*(2,0): expected='endstream' actual=''
> org.apache.pdfbox.io.PushBackInputStream@2918958e
> Title:
> Outlinks: 0
> Content Metadata:
> Parse Metadata:
>
> $ bin/nutch org.apache.nutch.parse.ParserChecker
> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf
> Exception in thread "main" java.lang.NullPointerException
>     at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
>
>
> $ java -jar /usr/local/bin/tika-app-0.9.jar
> http://www.egamaster.com/datos/politica_fr.pdf
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="xmpTPg:NPages" content="1"/>
> <meta name="Creation-Date" content="2009-01-15T11:50:32Z"/>
> ....
>
>
>
> <property>
>   <name>http.content.limit</name>
> <value>*200000* <!--  65536--></value>
>   <description>The length limit for downloaded content using the http
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   </description>
> </property>
>
>
>  <mime-type type="application/pdf">
>     <alias type="application/x-pdf"/>
>     <acronym>PDF</acronym>
>     <comment>Portable Document Format</comment>
>     <magic priority="50">
>       <match value="%PDF-" type="string" offset="0"/>
>     </magic>
>     <glob pattern="*.pdf"/>
>   </mime-type>
>
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-tf|urlnormalizer-(pass|regex|basic)</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please
> enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library. Nutch now also includes
> integration with Tika
>   to leverage Tika's parsing capabilities for multiple content types. The
> existing Nutch
>   parser implementations will likely be phased out in the next release or
> so, as such, it is
>   a good idea to begin migrating away from anything not provided by
> parse-tika.
>   </description>
> </property>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Unable to extract PDF content

Reply via email to