Kiran, I took a look at your nutch-site.xml and I did not see anything for http.accept. I believe nutch-default.xml does not include application/pdf by default in http.accept so you may need to add it in your nutch-site.xml. Please take a look at the example below from my nutch-site.xml
<property> <name>http.accept</name> <value>text/html,application/xhtml+xml,application/xml,application/pdf;q=0.9,*/*;q=0.8</value> <description>Value of the "Accept" request header field. </description> </property> Good Luck James -----Original Message----- From: kiran chitturi [mailto:[email protected]] Sent: Friday, October 19, 2012 6:41 AM To: [email protected] Subject: Re: Nutch 2.x : ParseUtil failing for some pdf files Hi James, I have increased the limit in nutch-site.xml ( https://github.com/salvager/nutch/blob/master/nutch-site.xml) and i have created the webpage table based on the fields here ( http://nlp.solutions.asia/?p=180). The database stills shows the parseStatus as '-org.apache.nutch.parse.ParseException: Unable to successfully parse content'. I am having text field nutch 'null' for them. This the the screenshot <https://raw.github.com/salvager/nutch/master/Screen%20shot%202012-10-18%20at%205.27.13%20PM.png>of mysql database that i have. Can you please tell me how can i overcome this problem ? This is the screenshot<https://raw.github.com/salvager/nutch/master/Screen%20shot%202012-10-18%20at%205.36.43%20PM.png> of my webpage table. Many Thanks for your help. Regards, Kiran. On Wed, Oct 17, 2012 at 6:20 AM, <[email protected]> wrote: > Hi Kiran, > > I agree with Julien it is probably trimmed content. > > I regularly parse PDFs with Nutch 2.x with MySQL as the backend > without problem (even without the patch). > > The differences in my set up from the standard set up that may be > applicable: > > 1) In nutch-site.xml the file.content.limit and http.content.limit are > set to 6000000. > 2) I have a custom create webpage table sql script that creates fields > that can hold more. The default table fields are not sufficiently > large in most real world situations. http://nlp.solutions.asia/?p=180 > > I crawled http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/ and it > successfully parsed all except one of the PDFs, v29n3.pdf. That PDF is > almost 20 megs much larger than the limit in nutch-default.xml and > even larger than that configured in my nutch-site.xml. Interestingly > that PDF is also completely pictures (what looks like text is actually > pictures of > text) so there may be no real text to parse. > > James > > ________________________________________ > From: Julien Nioche [[email protected]] > Sent: Wednesday, October 17, 2012 4:17 PM > To: [email protected] > Subject: Re: Nutch 2.x : ParseUtil failing for some pdf files > > trimmed content? > > On 16 October 2012 22:47, kiran chitturi <[email protected]> > wrote: > > > Hi, > > > > I am running Nutch 2.x with patch here at > > https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a > mysql > > database. > > > > After the {inject, generate, fetch} commands when i issue the > > command (sh bin/nutch parse 1350396627-126726428) the parserJob was > > success but when > i > > look inside the database only one pdf file is parsed out of 10. > > > > When i look in to hadoop.log it shows the statement '2012-10-16 > > 16:04:30,682 WARN parse.ParseUtil - Unable to successfully parse > > content > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type > > application/pdf' like this. > > > > The logs of successfully parsed and failed ones are below. The logs > > below show that pdf file '......./agosto.pdf' is parsed and the file > > '..../authors.pdf' is not parsed. > > > > The same thing happened for all other pdf files, the parse failed. > > When i do the 'sh bin/nutch parsechecker {url}' it worked with the > > failed pdf files and it does not show any errors. > > > > > > 2012-10-16 16:04:28,150 INFO parse.ParserJob - Parsing > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf > > > 2012-10-16 16:04:28,151 INFO parse.ParserFactory - The parsing > plugins: > > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > > > plugin.includes system property, and all claim to support the > > > content > > type > > > application/pdf, but they are not mapp ed to it in the > > > parse-plugins.xml file > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > content-type application/pdf > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > dcterms:modified 2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > meta:creation-date 2010-10-20T21:12:47Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > meta:save-date 2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > last-modified 2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > dc:creator Denise E. Agosto > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > dcterms:created 2010-10-20T21:12:47Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > creation-date 2010-10-20T21:12:47Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > date > > > 2010-10-20T21:12:47Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > xmp:creatortool ScanWizard 5 > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > modified 2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > creator Denise E. Agosto > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > author Denise E. Agosto > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > xmptpg:npages 4 > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > meta:author Denise E. Agosto > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > created Wed Oct 20 17:12:47 EDT 2010 > > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > > producer Adobe Acrobat 9.4 Paper Capture Plug-in > > > 2012-10-16 16:04:30,550 WARN parse.MetaTagsParser - Found meta tag : > > > last-save-date 2010-11-02T20:51:27Z > > > 2012-10-16 16:04:30,550 WARN parse.MetaTagsParser - Found meta tag : > > > dc:title ALAN v29n3 - Facilitating Student Connections to Judith > > > Ortiz Cofer's The Line of the Sun and Esmeralda Santiago's Almost > > > a Woman > > > 2012-10-16 16:04:30,631 INFO parse.ParserJob - Parsing > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf > > > 2012-10-16 16:04:30,680 WARN parse.MetaTagsParser - Found meta tag : > > > content-type application/pdf > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > meta:creation-date 2010-10-20T21:00:15Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > dcterms:modified 2010-11-02T20:51:57Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > meta:save-date 2010-11-02T20:51:57Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > last-modified 2010-11-02T20:51:57Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > dcterms:created 2010-10-20T21:00:15Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > creation-date 2010-10-20T21:00:15Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > date > > > 2010-10-20T21:00:15Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > xmp:creatortool ScanWizard 5 > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > modified 2010-11-02T20:51:57Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > xmptpg:npages 1 > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > created Wed Oct 20 17:00:15 EDT 2010 > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > producer Adobe Acrobat 9.4 Paper Capture Plug-in > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > last-save-date 2010-11-02T20:51:57Z > > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > > dc:title ALAN v29n3 - INSTRUCTIONS FOR AUTHORS > > > 2012-10-16 16:04:30,682 WARN parse.ParseUtil - Unable to > > > successfully parse content > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of > > > type application/pdf > > > 2012-10-16 16:04:30,692 INFO parse.ParserJob - Parsing > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/brown.pdf > > > > > > > Is there any way i can get more logs about knowing whether the error > > is file specific or error from internal parser ? > > > > Thank you, > > -- > > Kiran Chitturi > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi

