Re: Nutch 2.x : ParseUtil failing for some pdf files

Lewis John Mcgibbney Sat, 20 Oct 2012 07:21:00 -0700

Hi Kiran,

Julien just committed an upgrade of the tika dependency in 2,x, can
you please make another attempt to get the parse stage working
successfully.


Thanks

Lewis

On Thu, Oct 18, 2012 at 10:41 PM, kiran chitturi
<[email protected]> wrote:
> Hi James,
>
> I have increased the limit in nutch-site.xml (
> https://github.com/salvager/nutch/blob/master/nutch-site.xml) and i have
> created the webpage table based on the fields here (
> http://nlp.solutions.asia/?p=180).
>
> The database stills shows the parseStatus as
> '    – org.apache.nutch.parse.ParseException: Unable to successfully parse
> content'.  I am having text field nutch 'null' for them. This the the
> screenshot
> <https://raw.github.com/salvager/nutch/master/Screen%20shot%202012-10-18%20at%205.27.13%20PM.png>of
> mysql database that i have.
>
> Can you please tell me how can i overcome this problem ? This is the
> screenshot<https://raw.github.com/salvager/nutch/master/Screen%20shot%202012-10-18%20at%205.36.43%20PM.png>
> of
> my webpage table.
>
> Many Thanks for your help.
>
> Regards,
> Kiran.
>
> On Wed, Oct 17, 2012 at 6:20 AM, <[email protected]> wrote:
>
>> Hi Kiran,
>>
>> I agree with Julien it is probably trimmed content.
>>
>> I regularly parse PDFs with Nutch 2.x with MySQL as the backend without
>> problem (even without the patch).
>>
>> The differences in my set up from the standard set up that may be
>> applicable:
>>
>> 1) In nutch-site.xml the file.content.limit and http.content.limit are set
>> to 6000000.
>> 2) I have a custom create webpage table sql script that creates fields
>> that can hold more.  The default table fields are not sufficiently large in
>> most real world situations. http://nlp.solutions.asia/?p=180
>>
>> I crawled http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/ and it
>> successfully parsed all except one of the PDFs, v29n3.pdf. That PDF is
>> almost 20 megs much larger than the limit in nutch-default.xml and even
>> larger than that configured in my nutch-site.xml. Interestingly that PDF is
>> also completely pictures (what looks like text is actually pictures of
>> text) so there may be no real text to parse.
>>
>> James
>>
>> ________________________________________
>> From: Julien Nioche [[email protected]]
>> Sent: Wednesday, October 17, 2012 4:17 PM
>> To: [email protected]
>> Subject: Re: Nutch 2.x : ParseUtil failing for some pdf files
>>
>> trimmed content?
>>
>> On 16 October 2012 22:47, kiran chitturi <[email protected]>
>> wrote:
>>
>> > Hi,
>> >
>> > I am running Nutch 2.x with patch here at
>> > https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a
>> mysql
>> > database.
>> >
>> > After the {inject, generate, fetch} commands when i issue the command (sh
>> > bin/nutch parse 1350396627-126726428) the parserJob was success but when
>> i
>> > look inside the database only one pdf file is parsed out of 10.
>> >
>> > When i look in to hadoop.log it shows the statement '2012-10-16
>> > 16:04:30,682 WARN  parse.ParseUtil - Unable to successfully parse content
>> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type
>> > application/pdf' like this.
>> >
>> > The logs of successfully parsed and failed ones are below. The logs below
>> > show that pdf file '......./agosto.pdf' is parsed and the file
>> > '..../authors.pdf' is not parsed.
>> >
>> > The same thing happened for all other pdf files, the parse failed. When i
>> > do the 'sh bin/nutch parsechecker {url}' it worked with the failed pdf
>> > files and it does not show any errors.
>> >
>> >
>> > 2012-10-16 16:04:28,150 INFO  parse.ParserJob - Parsing
>> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf
>> > > 2012-10-16 16:04:28,151 INFO  parse.ParserFactory - The parsing
>> plugins:
>> > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
>> > > plugin.includes system property, and all claim to support the content
>> > type
>> > > application/pdf, but they are not mapp
>> > > ed to it  in the parse-plugins.xml file
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > content-type      application/pdf
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > dcterms:modified  2010-11-02T20:51:27Z
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > meta:creation-date        2010-10-20T21:12:47Z
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > meta:save-date    2010-11-02T20:51:27Z
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > last-modified     2010-11-02T20:51:27Z
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > dc:creator        Denise E. Agosto
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > dcterms:created   2010-10-20T21:12:47Z
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > creation-date     2010-10-20T21:12:47Z
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > date
>> > >      2010-10-20T21:12:47Z
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > xmp:creatortool   ScanWizard 5
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > modified  2010-11-02T20:51:27Z
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > creator   Denise E. Agosto
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > author    Denise E. Agosto
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > xmptpg:npages     4
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > meta:author       Denise E. Agosto
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > created   Wed Oct 20 17:12:47 EDT 2010
>> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
>> > > producer  Adobe Acrobat 9.4 Paper Capture Plug-in
>> > > 2012-10-16 16:04:30,550 WARN  parse.MetaTagsParser - Found meta tag :
>> > > last-save-date    2010-11-02T20:51:27Z
>> > > 2012-10-16 16:04:30,550 WARN  parse.MetaTagsParser - Found meta tag :
>> > > dc:title  ALAN v29n3 - Facilitating Student Connections to Judith Ortiz
>> > > Cofer's The Line of the Sun and Esmeralda Santiago's Almost a Woman
>> > > 2012-10-16 16:04:30,631 INFO  parse.ParserJob - Parsing
>> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf
>> > > 2012-10-16 16:04:30,680 WARN  parse.MetaTagsParser - Found meta tag :
>> > > content-type      application/pdf
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > meta:creation-date        2010-10-20T21:00:15Z
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > dcterms:modified  2010-11-02T20:51:57Z
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > meta:save-date    2010-11-02T20:51:57Z
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > last-modified     2010-11-02T20:51:57Z
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > dcterms:created   2010-10-20T21:00:15Z
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > creation-date     2010-10-20T21:00:15Z
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > date
>> > >      2010-10-20T21:00:15Z
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > xmp:creatortool   ScanWizard 5
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > modified  2010-11-02T20:51:57Z
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > xmptpg:npages     1
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > created   Wed Oct 20 17:00:15 EDT 2010
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > producer  Adobe Acrobat 9.4 Paper Capture Plug-in
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > last-save-date    2010-11-02T20:51:57Z
>> > > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
>> > > dc:title  ALAN v29n3 - INSTRUCTIONS FOR AUTHORS
>> > > 2012-10-16 16:04:30,682 WARN  parse.ParseUtil - Unable to successfully
>> > > parse content
>> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type
>> > > application/pdf
>> > > 2012-10-16 16:04:30,692 INFO  parse.ParserJob - Parsing
>> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/brown.pdf
>> > >
>> >
>> > Is there any way i can get more logs about knowing whether the error is
>> > file specific or error from internal parser ?
>> >
>> > Thank you,
>> > --
>> > Kiran Chitturi
>> >
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>
>
> --
> Kiran Chitturi



-- 
Lewis

Re: Nutch 2.x : ParseUtil failing for some pdf files

Reply via email to