Hi Kiran, I agree with Julien it is probably trimmed content.
I regularly parse PDFs with Nutch 2.x with MySQL as the backend without problem (even without the patch). The differences in my set up from the standard set up that may be applicable: 1) In nutch-site.xml the file.content.limit and http.content.limit are set to 6000000. 2) I have a custom create webpage table sql script that creates fields that can hold more. The default table fields are not sufficiently large in most real world situations. http://nlp.solutions.asia/?p=180 I crawled http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/ and it successfully parsed all except one of the PDFs, v29n3.pdf. That PDF is almost 20 megs much larger than the limit in nutch-default.xml and even larger than that configured in my nutch-site.xml. Interestingly that PDF is also completely pictures (what looks like text is actually pictures of text) so there may be no real text to parse. James ________________________________________ From: Julien Nioche [[email protected]] Sent: Wednesday, October 17, 2012 4:17 PM To: [email protected] Subject: Re: Nutch 2.x : ParseUtil failing for some pdf files trimmed content? On 16 October 2012 22:47, kiran chitturi <[email protected]> wrote: > Hi, > > I am running Nutch 2.x with patch here at > https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a mysql > database. > > After the {inject, generate, fetch} commands when i issue the command (sh > bin/nutch parse 1350396627-126726428) the parserJob was success but when i > look inside the database only one pdf file is parsed out of 10. > > When i look in to hadoop.log it shows the statement '2012-10-16 > 16:04:30,682 WARN parse.ParseUtil - Unable to successfully parse content > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type > application/pdf' like this. > > The logs of successfully parsed and failed ones are below. The logs below > show that pdf file '......./agosto.pdf' is parsed and the file > '..../authors.pdf' is not parsed. > > The same thing happened for all other pdf files, the parse failed. When i > do the 'sh bin/nutch parsechecker {url}' it worked with the failed pdf > files and it does not show any errors. > > > 2012-10-16 16:04:28,150 INFO parse.ParserJob - Parsing > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf > > 2012-10-16 16:04:28,151 INFO parse.ParserFactory - The parsing plugins: > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > > plugin.includes system property, and all claim to support the content > type > > application/pdf, but they are not mapp > > ed to it in the parse-plugins.xml file > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > content-type application/pdf > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > dcterms:modified 2010-11-02T20:51:27Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > meta:creation-date 2010-10-20T21:12:47Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > meta:save-date 2010-11-02T20:51:27Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > last-modified 2010-11-02T20:51:27Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > dc:creator Denise E. Agosto > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > dcterms:created 2010-10-20T21:12:47Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > creation-date 2010-10-20T21:12:47Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > date > > 2010-10-20T21:12:47Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > xmp:creatortool ScanWizard 5 > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > modified 2010-11-02T20:51:27Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > creator Denise E. Agosto > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > author Denise E. Agosto > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > xmptpg:npages 4 > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > meta:author Denise E. Agosto > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > created Wed Oct 20 17:12:47 EDT 2010 > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > producer Adobe Acrobat 9.4 Paper Capture Plug-in > > 2012-10-16 16:04:30,550 WARN parse.MetaTagsParser - Found meta tag : > > last-save-date 2010-11-02T20:51:27Z > > 2012-10-16 16:04:30,550 WARN parse.MetaTagsParser - Found meta tag : > > dc:title ALAN v29n3 - Facilitating Student Connections to Judith Ortiz > > Cofer's The Line of the Sun and Esmeralda Santiago's Almost a Woman > > 2012-10-16 16:04:30,631 INFO parse.ParserJob - Parsing > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf > > 2012-10-16 16:04:30,680 WARN parse.MetaTagsParser - Found meta tag : > > content-type application/pdf > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > meta:creation-date 2010-10-20T21:00:15Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > dcterms:modified 2010-11-02T20:51:57Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > meta:save-date 2010-11-02T20:51:57Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > last-modified 2010-11-02T20:51:57Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > dcterms:created 2010-10-20T21:00:15Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > creation-date 2010-10-20T21:00:15Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > date > > 2010-10-20T21:00:15Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > xmp:creatortool ScanWizard 5 > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > modified 2010-11-02T20:51:57Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > xmptpg:npages 1 > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > created Wed Oct 20 17:00:15 EDT 2010 > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > producer Adobe Acrobat 9.4 Paper Capture Plug-in > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > last-save-date 2010-11-02T20:51:57Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > dc:title ALAN v29n3 - INSTRUCTIONS FOR AUTHORS > > 2012-10-16 16:04:30,682 WARN parse.ParseUtil - Unable to successfully > > parse content > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type > > application/pdf > > 2012-10-16 16:04:30,692 INFO parse.ParserJob - Parsing > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/brown.pdf > > > > Is there any way i can get more logs about knowing whether the error is > file specific or error from internal parser ? > > Thank you, > -- > Kiran Chitturi > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

