Re: Unable to extract PDF content

Gabriele Kahlout Mon, 21 Mar 2011 08:09:12 -0700

I've tried with a small file and it worked:

$ bin/nutch org.apache.nutch.parse.ParserChecker
http://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3&hl=en
[1] 15177
michaela:nutch-1.2 simpatico$ ---------
Url
---------------
http://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: Moved Temporarily
Outlinks: 1
  outlink: toUrl:
https://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3anchor:
here
Content Metadata: X-Frame-Options=SAMEORIGIN Date=Mon, 21 Mar 2011 15:05:06
GMT X-XSS-Protection=1; mode=block Expires=Mon, 21 Mar 2011 15:05:06 GMT
Location=
https://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3Via=1.0
cerbero.localdomain:800 (squid/2.6.STABLE21) Connection=close
Content-Type=text/html; charset=UTF-8 X-Cache=MISS from cerbero.localdomain
Server=GSE X-Content-Type-Options=nosniff Cache-Control=private, max-age=0
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8



On Mon, Mar 21, 2011 at 3:34 PM, Julien Nioche <
[email protected]> wrote:

> I've tried both URLs with nutch-1.3 with
>
> <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
>   <description/>
> </property>
>
> and ./nutch org.apache.nutch.parse.ParserChecker ran without any problems.
>
> It could be that the nutch-site.xml is loaded from conf/ then overriden by
> the one found in the job file (which is the reason why we separated the
> runtime/deploy from runtime/local in nutch 1.3 and 2.0). Try deleting the
> job file or generating a fresh one with 'ant job' and see if this fixes the
> issue.
>
> Julien
>
>
> On 21 March 2011 13:43, Gabriele Kahlout <[email protected]> wrote:
>
>> I'm also having the same issue with nutch-1.2.
>>
>> $ bin/nutch org.apache.nutch.parse.ParserChecker
>> http://www.egamaster.com/datos/politica_fr.pdf
>> ---------
>> Url
>> ---------------
>> http://www.egamaster.com/datos/politica_fr.pdf---------
>> ParseData
>> ---------
>> Version: 5
>> Status: *failed*(2,0): expected='endstream' actual=''
>> org.apache.pdfbox.io.PushBackInputStream@2918958e
>> Title:
>> Outlinks: 0
>> Content Metadata:
>> Parse Metadata:
>>
>> $ bin/nutch org.apache.nutch.parse.ParserChecker
>> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf
>> Exception in thread "main" java.lang.NullPointerException
>>     at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
>>
>>
>> $ java -jar /usr/local/bin/tika-app-0.9.jar
>> http://www.egamaster.com/datos/politica_fr.pdf
>> <?xml version="1.0" encoding="UTF-8"?>
>> <html xmlns="http://www.w3.org/1999/xhtml";>
>> <head>
>> <meta name="xmpTPg:NPages" content="1"/>
>> <meta name="Creation-Date" content="2009-01-15T11:50:32Z"/>
>> ....
>>
>>
>>
>> <property>
>>   <name>http.content.limit</name>
>> <value>*200000* <!--  65536--></value>
>>   <description>The length limit for downloaded content using the http
>>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>>   than it will be truncated; otherwise, no truncation at all. Do not
>>   confuse this setting with the file.content.limit setting.
>>   </description>
>> </property>
>>
>>
>>  <mime-type type="application/pdf">
>>     <alias type="application/x-pdf"/>
>>     <acronym>PDF</acronym>
>>     <comment>Portable Document Format</comment>
>>     <magic priority="50">
>>       <match value="%PDF-" type="string" offset="0"/>
>>     </magic>
>>     <glob pattern="*.pdf"/>
>>   </mime-type>
>>
>> <property>
>>   <name>plugin.includes</name>
>>
>> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-tf|urlnormalizer-(pass|regex|basic)</value>
>>   <description>Regular expression naming plugin directory names to
>>   include.  Any plugin not matching this expression is excluded.
>>   In any case you need at least include the nutch-extensionpoints plugin.
>> By
>>   default Nutch includes crawling just HTML and plain text via HTTP,
>>   and basic indexing and search plugins. In order to use HTTPS please
>> enable
>>   protocol-httpclient, but be aware of possible intermittent problems with
>> the
>>   underlying commons-httpclient library. Nutch now also includes
>> integration with Tika
>>   to leverage Tika's parsing capabilities for multiple content types. The
>> existing Nutch
>>   parser implementations will likely be phased out in the next release or
>> so, as such, it is
>>   a good idea to begin migrating away from anything not provided by
>> parse-tika.
>>   </description>
>> </property>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Unable to extract PDF content

Reply via email to