trimmed content? On 16 October 2012 22:47, kiran chitturi <[email protected]> wrote:
> Hi, > > I am running Nutch 2.x with patch here at > https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a mysql > database. > > After the {inject, generate, fetch} commands when i issue the command (sh > bin/nutch parse 1350396627-126726428) the parserJob was success but when i > look inside the database only one pdf file is parsed out of 10. > > When i look in to hadoop.log it shows the statement '2012-10-16 > 16:04:30,682 WARN parse.ParseUtil - Unable to successfully parse content > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type > application/pdf' like this. > > The logs of successfully parsed and failed ones are below. The logs below > show that pdf file '......./agosto.pdf' is parsed and the file > '..../authors.pdf' is not parsed. > > The same thing happened for all other pdf files, the parse failed. When i > do the 'sh bin/nutch parsechecker {url}' it worked with the failed pdf > files and it does not show any errors. > > > 2012-10-16 16:04:28,150 INFO parse.ParserJob - Parsing > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf > > 2012-10-16 16:04:28,151 INFO parse.ParserFactory - The parsing plugins: > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > > plugin.includes system property, and all claim to support the content > type > > application/pdf, but they are not mapp > > ed to it in the parse-plugins.xml file > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > content-type application/pdf > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > dcterms:modified 2010-11-02T20:51:27Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > meta:creation-date 2010-10-20T21:12:47Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > meta:save-date 2010-11-02T20:51:27Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > last-modified 2010-11-02T20:51:27Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > dc:creator Denise E. Agosto > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > dcterms:created 2010-10-20T21:12:47Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > creation-date 2010-10-20T21:12:47Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > date > > 2010-10-20T21:12:47Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > xmp:creatortool ScanWizard 5 > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > modified 2010-11-02T20:51:27Z > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > creator Denise E. Agosto > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > author Denise E. Agosto > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > xmptpg:npages 4 > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > meta:author Denise E. Agosto > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > created Wed Oct 20 17:12:47 EDT 2010 > > 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : > > producer Adobe Acrobat 9.4 Paper Capture Plug-in > > 2012-10-16 16:04:30,550 WARN parse.MetaTagsParser - Found meta tag : > > last-save-date 2010-11-02T20:51:27Z > > 2012-10-16 16:04:30,550 WARN parse.MetaTagsParser - Found meta tag : > > dc:title ALAN v29n3 - Facilitating Student Connections to Judith Ortiz > > Cofer's The Line of the Sun and Esmeralda Santiago's Almost a Woman > > 2012-10-16 16:04:30,631 INFO parse.ParserJob - Parsing > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf > > 2012-10-16 16:04:30,680 WARN parse.MetaTagsParser - Found meta tag : > > content-type application/pdf > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > meta:creation-date 2010-10-20T21:00:15Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > dcterms:modified 2010-11-02T20:51:57Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > meta:save-date 2010-11-02T20:51:57Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > last-modified 2010-11-02T20:51:57Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > dcterms:created 2010-10-20T21:00:15Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > creation-date 2010-10-20T21:00:15Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > date > > 2010-10-20T21:00:15Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > xmp:creatortool ScanWizard 5 > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > modified 2010-11-02T20:51:57Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > xmptpg:npages 1 > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > created Wed Oct 20 17:00:15 EDT 2010 > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > producer Adobe Acrobat 9.4 Paper Capture Plug-in > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > last-save-date 2010-11-02T20:51:57Z > > 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : > > dc:title ALAN v29n3 - INSTRUCTIONS FOR AUTHORS > > 2012-10-16 16:04:30,682 WARN parse.ParseUtil - Unable to successfully > > parse content > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type > > application/pdf > > 2012-10-16 16:04:30,692 INFO parse.ParserJob - Parsing > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/brown.pdf > > > > Is there any way i can get more logs about knowing whether the error is > file specific or error from internal parser ? > > Thank you, > -- > Kiran Chitturi > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

