Re: Unable to extract PDF content

Gabriele Kahlout Thu, 31 Mar 2011 02:00:29 -0700

I'm still not able to parse those pdfs, although they are fetched:

QueueFeeder finished: total 3 records + hit by time limit :0
fetching http://www.egamaster.com/datos/politica_fr.pdf
fetching http://singinst.org/upload/artificial-intelligence-risk.pdf
-finishing thread FetcherThread, activeThreads=6
fetching
http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
*Error parsing: http://singinst.org/upload/artificial-intelligence-risk.pdf:
failed(2,0): expected='endstream' actual=''
org.apache.pdfbox.io.PushBackInputStream@3d3c33b7*
-finishing thread FetcherThread, activeThreads=2
*Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0):
expected='endstream' actual=''
org.apache.pdfbox.io.PushBackInputStream@2bf8f8c8*
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
*Error parsing:
http://news.softpedia.com/newsPDF/Functional-Artificial-Leaf-Created-191670.pdf:
failed(2,0): expected='endstream' actual=''
org.apache.pdfbox.io.PushBackInputStream@4b6c06dd*
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-03-31 10:50:37, elapsed: 00:00:14


Statistics for CrawlDb: crawl/crawldb/0
*TOTAL urls: 3*
retry 0: 3
min score: 1.0
avg score: 1.0
max score: 1.0
*status 2 (db_fetched): 3*
CrawlDb statistics: done



On Mon, Mar 21, 2011 at 4:28 PM, Gabriele Kahlout
<[email protected]>wrote:

> Otherwise trying your instructions is not working out. When will nutch-1.3
> be released?
>
>
> On Mon, Mar 21, 2011 at 4:07 PM, Gabriele Kahlout <
> [email protected]> wrote:
>
>> I've tried with a small file and it worked:
>>
>> $ bin/nutch org.apache.nutch.parse.ParserChecker
>> http://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3&hl=en
>> [1] 15177
>> michaela:nutch-1.2 simpatico$ ---------
>> Url
>> ---------------
>>
>> http://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3---------
>> ParseData
>> ---------
>> Version: 5
>> Status: success(1,0)
>> Title: Moved Temporarily
>> Outlinks: 1
>>   outlink: toUrl:
>> https://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3anchor:
>>  here
>> Content Metadata: X-Frame-Options=SAMEORIGIN Date=Mon, 21 Mar 2011
>> 15:05:06 GMT X-XSS-Protection=1; mode=block Expires=Mon, 21 Mar 2011
>> 15:05:06 GMT Location=
>> https://docs.google.com/leaf?id=0B7OkJm_8TIzzMDA0NzgxMzMtYTk0Ni00MTcwLTk0MTgtZjE0OGM4NDE5MWQ3Via=1.0
>>  cerbero.localdomain:800 (squid/2.6.STABLE21) Connection=close
>> Content-Type=text/html; charset=UTF-8 X-Cache=MISS from cerbero.localdomain
>> Server=GSE X-Content-Type-Options=nosniff Cache-Control=private, max-age=0
>> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>>
>>
>>
>> On Mon, Mar 21, 2011 at 3:34 PM, Julien Nioche <
>> [email protected]> wrote:
>>
>>> I've tried both URLs with nutch-1.3 with
>>>
>>> <property>
>>>   <name>http.content.limit</name>
>>>   <value>-1</value>
>>>   <description/>
>>> </property>
>>>
>>> and ./nutch org.apache.nutch.parse.ParserChecker ran without any
>>> problems.
>>>
>>> It could be that the nutch-site.xml is loaded from conf/ then overriden
>>> by the one found in the job file (which is the reason why we separated the
>>> runtime/deploy from runtime/local in nutch 1.3 and 2.0). Try deleting the
>>> job file or generating a fresh one with 'ant job' and see if this fixes the
>>> issue.
>>>
>>> Julien
>>>
>>>
>>> On 21 March 2011 13:43, Gabriele Kahlout <[email protected]>wrote:
>>>
>>>> I'm also having the same issue with nutch-1.2.
>>>>
>>>> $ bin/nutch org.apache.nutch.parse.ParserChecker
>>>> http://www.egamaster.com/datos/politica_fr.pdf
>>>> ---------
>>>> Url
>>>> ---------------
>>>> http://www.egamaster.com/datos/politica_fr.pdf---------
>>>> ParseData
>>>> ---------
>>>> Version: 5
>>>> Status: *failed*(2,0): expected='endstream' actual=''
>>>> org.apache.pdfbox.io.PushBackInputStream@2918958e
>>>> Title:
>>>> Outlinks: 0
>>>> Content Metadata:
>>>> Parse Metadata:
>>>>
>>>> $ bin/nutch org.apache.nutch.parse.ParserChecker
>>>> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf
>>>> Exception in thread "main" java.lang.NullPointerException
>>>>     at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
>>>>
>>>>
>>>> $ java -jar /usr/local/bin/tika-app-0.9.jar
>>>> http://www.egamaster.com/datos/politica_fr.pdf
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <html xmlns="http://www.w3.org/1999/xhtml";>
>>>> <head>
>>>> <meta name="xmpTPg:NPages" content="1"/>
>>>> <meta name="Creation-Date" content="2009-01-15T11:50:32Z"/>
>>>> ....
>>>>
>>>>
>>>>
>>>> <property>
>>>>   <name>http.content.limit</name>
>>>> <value>*200000* <!--  65536--></value>
>>>>   <description>The length limit for downloaded content using the http
>>>>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>>>>   than it will be truncated; otherwise, no truncation at all. Do not
>>>>   confuse this setting with the file.content.limit setting.
>>>>   </description>
>>>> </property>
>>>>
>>>>
>>>>  <mime-type type="application/pdf">
>>>>     <alias type="application/x-pdf"/>
>>>>     <acronym>PDF</acronym>
>>>>     <comment>Portable Document Format</comment>
>>>>     <magic priority="50">
>>>>       <match value="%PDF-" type="string" offset="0"/>
>>>>     </magic>
>>>>     <glob pattern="*.pdf"/>
>>>>   </mime-type>
>>>>
>>>> <property>
>>>>   <name>plugin.includes</name>
>>>>
>>>> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-tf|urlnormalizer-(pass|regex|basic)</value>
>>>>   <description>Regular expression naming plugin directory names to
>>>>   include.  Any plugin not matching this expression is excluded.
>>>>   In any case you need at least include the nutch-extensionpoints
>>>> plugin. By
>>>>   default Nutch includes crawling just HTML and plain text via HTTP,
>>>>   and basic indexing and search plugins. In order to use HTTPS please
>>>> enable
>>>>   protocol-httpclient, but be aware of possible intermittent problems
>>>> with the
>>>>   underlying commons-httpclient library. Nutch now also includes
>>>> integration with Tika
>>>>   to leverage Tika's parsing capabilities for multiple content types.
>>>> The existing Nutch
>>>>   parser implementations will likely be phased out in the next release
>>>> or so, as such, it is
>>>>   a good idea to begin migrating away from anything not provided by
>>>> parse-tika.
>>>>   </description>
>>>> </property>
>>>>
>>>> --
>>>> Regards,
>>>> K. Gabriele
>>>>
>>>> --- unchanged since 20/9/10 ---
>>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>>> receipt within 48 hours then I don't resend the email.
>>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>>
>>>> If an email is sent by a sender that is not a trusted contact or the
>>>> email does not contain a valid code then the email is not received. A valid
>>>> code starts with a hyphen and ends with "X".
>>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>>> L(-[a-z]+[0-9]X)).
>>>>
>>>>
>>>
>>>
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Unable to extract PDF content

Reply via email to