Re: Unable to extract PDF content

Gabriele Kahlout Mon, 21 Mar 2011 08:29:07 -0700

Otherwise trying your instructions is not working out. When will nutch-1.3
be released?


On Mon, Mar 21, 2011 at 4:07 PM, Gabriele Kahlout
<[email protected]>wrote:

> I've tried with a small file and it worked:
>
> $ bin/nutch org.apache.nutch.parse.ParserChecker
> http://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3&hl=en
> [1] 15177
> michaela:nutch-1.2 simpatico$ ---------
> Url
> ---------------
>
> http://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title: Moved Temporarily
> Outlinks: 1
>   outlink: toUrl:
> https://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3anchor:
>  here
> Content Metadata: X-Frame-Options=SAMEORIGIN Date=Mon, 21 Mar 2011 15:05:06
> GMT X-XSS-Protection=1; mode=block Expires=Mon, 21 Mar 2011 15:05:06 GMT
> Location=
> https://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3Via=1.0
>  cerbero.localdomain:800 (squid/2.6.STABLE21) Connection=close
> Content-Type=text/html; charset=UTF-8 X-Cache=MISS from cerbero.localdomain
> Server=GSE X-Content-Type-Options=nosniff Cache-Control=private, max-age=0
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>
>
> On Mon, Mar 21, 2011 at 3:34 PM, Julien Nioche <
> [email protected]> wrote:
>
>> I've tried both URLs with nutch-1.3 with
>>
>> <property>
>>   <name>http.content.limit</name>
>>   <value>-1</value>
>>   <description/>
>> </property>
>>
>> and ./nutch org.apache.nutch.parse.ParserChecker ran without any problems.
>>
>> It could be that the nutch-site.xml is loaded from conf/ then overriden by
>> the one found in the job file (which is the reason why we separated the
>> runtime/deploy from runtime/local in nutch 1.3 and 2.0). Try deleting the
>> job file or generating a fresh one with 'ant job' and see if this fixes the
>> issue.
>>
>> Julien
>>
>>
>> On 21 March 2011 13:43, Gabriele Kahlout <[email protected]>wrote:
>>
>>> I'm also having the same issue with nutch-1.2.
>>>
>>> $ bin/nutch org.apache.nutch.parse.ParserChecker
>>> http://www.egamaster.com/datos/politica_fr.pdf
>>> ---------
>>> Url
>>> ---------------
>>> http://www.egamaster.com/datos/politica_fr.pdf---------
>>> ParseData
>>> ---------
>>> Version: 5
>>> Status: *failed*(2,0): expected='endstream' actual=''
>>> org.apache.pdfbox.io.PushBackInputStream@2918958e
>>> Title:
>>> Outlinks: 0
>>> Content Metadata:
>>> Parse Metadata:
>>>
>>> $ bin/nutch org.apache.nutch.parse.ParserChecker
>>> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf
>>> Exception in thread "main" java.lang.NullPointerException
>>>     at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
>>>
>>>
>>> $ java -jar /usr/local/bin/tika-app-0.9.jar
>>> http://www.egamaster.com/datos/politica_fr.pdf
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <html xmlns="http://www.w3.org/1999/xhtml";>
>>> <head>
>>> <meta name="xmpTPg:NPages" content="1"/>
>>> <meta name="Creation-Date" content="2009-01-15T11:50:32Z"/>
>>> ....
>>>
>>>
>>>
>>> <property>
>>>   <name>http.content.limit</name>
>>> <value>*200000* <!--  65536--></value>
>>>   <description>The length limit for downloaded content using the http
>>>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>>>   than it will be truncated; otherwise, no truncation at all. Do not
>>>   confuse this setting with the file.content.limit setting.
>>>   </description>
>>> </property>
>>>
>>>
>>>  <mime-type type="application/pdf">
>>>     <alias type="application/x-pdf"/>
>>>     <acronym>PDF</acronym>
>>>     <comment>Portable Document Format</comment>
>>>     <magic priority="50">
>>>       <match value="%PDF-" type="string" offset="0"/>
>>>     </magic>
>>>     <glob pattern="*.pdf"/>
>>>   </mime-type>
>>>
>>> <property>
>>>   <name>plugin.includes</name>
>>>
>>> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-tf|urlnormalizer-(pass|regex|basic)</value>
>>>   <description>Regular expression naming plugin directory names to
>>>   include.  Any plugin not matching this expression is excluded.
>>>   In any case you need at least include the nutch-extensionpoints plugin.
>>> By
>>>   default Nutch includes crawling just HTML and plain text via HTTP,
>>>   and basic indexing and search plugins. In order to use HTTPS please
>>> enable
>>>   protocol-httpclient, but be aware of possible intermittent problems
>>> with the
>>>   underlying commons-httpclient library. Nutch now also includes
>>> integration with Tika
>>>   to leverage Tika's parsing capabilities for multiple content types. The
>>> existing Nutch
>>>   parser implementations will likely be phased out in the next release or
>>> so, as such, it is
>>>   a good idea to begin migrating away from anything not provided by
>>> parse-tika.
>>>   </description>
>>> </property>
>>>
>>> --
>>> Regards,
>>> K. Gabriele
>>>
>>> --- unchanged since 20/9/10 ---
>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>> receipt within 48 hours then I don't resend the email.
>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>
>>> If an email is sent by a sender that is not a trusted contact or the
>>> email does not contain a valid code then the email is not received. A valid
>>> code starts with a hyphen and ends with "X".
>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>> L(-[a-z]+[0-9]X)).
>>>
>>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Unable to extract PDF content

Reply via email to