RE: Nutch 2.x : ParseUtil failing for some pdf files

j.sullivan Thu, 18 Oct 2012 18:09:56 -0700

Kiran,

I took a look at your nutch-site.xml and I did not see anything for 
http.accept. I believe nutch-default.xml does not include application/pdf by 
default in http.accept so you may need to add it in your nutch-site.xml.  
Please take a look at the example below from my nutch-site.xml



<property>
  <name>http.accept</name>
  
<value>text/html,application/xhtml+xml,application/xml,application/pdf;q=0.9,*/*;q=0.8</value>
  <description>Value of the "Accept" request header field.
  </description>
</property>

Good Luck

James

-----Original Message-----
From: kiran chitturi [mailto:[email protected]] 
Sent: Friday, October 19, 2012 6:41 AM
To: [email protected]
Subject: Re: Nutch 2.x : ParseUtil failing for some pdf files

Hi James,

I have increased the limit in nutch-site.xml (
https://github.com/salvager/nutch/blob/master/nutch-site.xml) and i have 
created the webpage table based on the fields here ( 
http://nlp.solutions.asia/?p=180).

The database stills shows the parseStatus as
'-org.apache.nutch.parse.ParseException: Unable to successfully parse 
content'.  I am having text field nutch 'null' for them. This the the 
screenshot 
<https://raw.github.com/salvager/nutch/master/Screen%20shot%202012-10-18%20at%205.27.13%20PM.png>of

mysql database that i have.

Can you please tell me how can i overcome this problem ? This is the 
screenshot<https://raw.github.com/salvager/nutch/master/Screen%20shot%202012-10-18%20at%205.36.43%20PM.png>
of
my webpage table.

Many Thanks for your help.

Regards,
Kiran.

On Wed, Oct 17, 2012 at 6:20 AM, <[email protected]> wrote:

> Hi Kiran,
>
> I agree with Julien it is probably trimmed content.
>
> I regularly parse PDFs with Nutch 2.x with MySQL as the backend 
> without problem (even without the patch).
>
> The differences in my set up from the standard set up that may be
> applicable:
>
> 1) In nutch-site.xml the file.content.limit and http.content.limit are 
> set to 6000000.
> 2) I have a custom create webpage table sql script that creates fields 
> that can hold more.  The default table fields are not sufficiently 
> large in most real world situations. http://nlp.solutions.asia/?p=180
>
> I crawled http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/ and it 
> successfully parsed all except one of the PDFs, v29n3.pdf. That PDF is 
> almost 20 megs much larger than the limit in nutch-default.xml and 
> even larger than that configured in my nutch-site.xml. Interestingly 
> that PDF is also completely pictures (what looks like text is actually 
> pictures of
> text) so there may be no real text to parse.
>
> James
>
> ________________________________________
> From: Julien Nioche [[email protected]]
> Sent: Wednesday, October 17, 2012 4:17 PM
> To: [email protected]
> Subject: Re: Nutch 2.x : ParseUtil failing for some pdf files
>
> trimmed content?
>
> On 16 October 2012 22:47, kiran chitturi <[email protected]>
> wrote:
>
> > Hi,
> >
> > I am running Nutch 2.x with patch here at
> > https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a
> mysql
> > database.
> >
> > After the {inject, generate, fetch} commands when i issue the 
> > command (sh bin/nutch parse 1350396627-126726428) the parserJob was 
> > success but when
> i
> > look inside the database only one pdf file is parsed out of 10.
> >
> > When i look in to hadoop.log it shows the statement '2012-10-16
> > 16:04:30,682 WARN  parse.ParseUtil - Unable to successfully parse 
> > content 
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type 
> > application/pdf' like this.
> >
> > The logs of successfully parsed and failed ones are below. The logs 
> > below show that pdf file '......./agosto.pdf' is parsed and the file 
> > '..../authors.pdf' is not parsed.
> >
> > The same thing happened for all other pdf files, the parse failed. 
> > When i do the 'sh bin/nutch parsechecker {url}' it worked with the 
> > failed pdf files and it does not show any errors.
> >
> >
> > 2012-10-16 16:04:28,150 INFO  parse.ParserJob - Parsing
> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf
> > > 2012-10-16 16:04:28,151 INFO  parse.ParserFactory - The parsing
> plugins:
> > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the 
> > > plugin.includes system property, and all claim to support the 
> > > content
> > type
> > > application/pdf, but they are not mapp ed to it  in the 
> > > parse-plugins.xml file
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > content-type      application/pdf
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > dcterms:modified  2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:creation-date        2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:save-date    2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > last-modified     2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > dc:creator        Denise E. Agosto
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > dcterms:created   2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > creation-date     2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > date
> > >      2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > xmp:creatortool   ScanWizard 5
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > modified  2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > creator   Denise E. Agosto
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > author    Denise E. Agosto
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > xmptpg:npages     4
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:author       Denise E. Agosto
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > created   Wed Oct 20 17:12:47 EDT 2010
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > producer  Adobe Acrobat 9.4 Paper Capture Plug-in
> > > 2012-10-16 16:04:30,550 WARN  parse.MetaTagsParser - Found meta tag :
> > > last-save-date    2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,550 WARN  parse.MetaTagsParser - Found meta tag :
> > > dc:title  ALAN v29n3 - Facilitating Student Connections to Judith 
> > > Ortiz Cofer's The Line of the Sun and Esmeralda Santiago's Almost 
> > > a Woman
> > > 2012-10-16 16:04:30,631 INFO  parse.ParserJob - Parsing 
> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf
> > > 2012-10-16 16:04:30,680 WARN  parse.MetaTagsParser - Found meta tag :
> > > content-type      application/pdf
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:creation-date        2010-10-20T21:00:15Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > dcterms:modified  2010-11-02T20:51:57Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:save-date    2010-11-02T20:51:57Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > last-modified     2010-11-02T20:51:57Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > dcterms:created   2010-10-20T21:00:15Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > creation-date     2010-10-20T21:00:15Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > date
> > >      2010-10-20T21:00:15Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > xmp:creatortool   ScanWizard 5
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > modified  2010-11-02T20:51:57Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > xmptpg:npages     1
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > created   Wed Oct 20 17:00:15 EDT 2010
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > producer  Adobe Acrobat 9.4 Paper Capture Plug-in
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > last-save-date    2010-11-02T20:51:57Z
> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > > dc:title  ALAN v29n3 - INSTRUCTIONS FOR AUTHORS
> > > 2012-10-16 16:04:30,682 WARN  parse.ParseUtil - Unable to 
> > > successfully parse content 
> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of 
> > > type application/pdf
> > > 2012-10-16 16:04:30,692 INFO  parse.ParserJob - Parsing 
> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/brown.pdf
> > >
> >
> > Is there any way i can get more logs about knowing whether the error 
> > is file specific or error from internal parser ?
> >
> > Thank you,
> > --
> > Kiran Chitturi
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



--
Kiran Chitturi

RE: Nutch 2.x : ParseUtil failing for some pdf files

Reply via email to