Re: Unable to extract PDF content

Gabriele Kahlout Mon, 21 Mar 2011 06:44:20 -0700

I'm also having the same issue with nutch-1.2.

$ bin/nutch org.apache.nutch.parse.ParserChecker
http://www.egamaster.com/datos/politica_fr.pdf
---------
Url
---------------
http://www.egamaster.com/datos/politica_fr.pdf---------
ParseData
---------
Version: 5
Status: *failed*(2,0): expected='endstream' actual=''
org.apache.pdfbox.io.PushBackInputStream@2918958e
Title:
Outlinks: 0
Content Metadata:
Parse Metadata:


$ bin/nutch org.apache.nutch.parse.ParserChecker
http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf
Exception in thread "main" java.lang.NullPointerException
    at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)


$ java -jar /usr/local/bin/tika-app-0.9.jar
http://www.egamaster.com/datos/politica_fr.pdf
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="Creation-Date" content="2009-01-15T11:50:32Z"/>
....



<property>
  <name>http.content.limit</name>
<value>*200000* <!--  65536--></value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>


 <mime-type type="application/pdf">
    <alias type="application/x-pdf"/>
    <acronym>PDF</acronym>
    <comment>Portable Document Format</comment>
    <magic priority="50">
      <match value="%PDF-" type="string" offset="0"/>
    </magic>
    <glob pattern="*.pdf"/>
  </mime-type>

<property>
  <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-tf|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library. Nutch now also includes integration
with Tika
  to leverage Tika's parsing capabilities for multiple content types. The
existing Nutch
  parser implementations will likely be phased out in the next release or
so, as such, it is
  a good idea to begin migrating away from anything not provided by
parse-tika.
  </description>
</property>

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Unable to extract PDF content

Reply via email to