Hi,
I am running Nutch 2.x with patch here at
https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a mysql
database.
After the {inject, generate, fetch} commands when i issue the command (sh
bin/nutch parse 1350396627-126726428) the parserJob was success but when i
look inside the database only one pdf file is parsed out of 10.
When i look in to hadoop.log it shows the statement '2012-10-16
16:04:30,682 WARN parse.ParseUtil - Unable to successfully parse content
http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type
application/pdf' like this.
The logs of successfully parsed and failed ones are below. The logs below
show that pdf file '......./agosto.pdf' is parsed and the file
'..../authors.pdf' is not parsed.
The same thing happened for all other pdf files, the parse failed. When i
do the 'sh bin/nutch parsechecker {url}' it worked with the failed pdf
files and it does not show any errors.
2012-10-16 16:04:28,150 INFO parse.ParserJob - Parsing
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf
> 2012-10-16 16:04:28,151 INFO parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content type
> application/pdf, but they are not mapp
> ed to it in the parse-plugins.xml file
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> content-type application/pdf
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> dcterms:modified 2010-11-02T20:51:27Z
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> meta:creation-date 2010-10-20T21:12:47Z
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> meta:save-date 2010-11-02T20:51:27Z
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> last-modified 2010-11-02T20:51:27Z
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> dc:creator Denise E. Agosto
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> dcterms:created 2010-10-20T21:12:47Z
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> creation-date 2010-10-20T21:12:47Z
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag : date
> 2010-10-20T21:12:47Z
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> xmp:creatortool ScanWizard 5
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> modified 2010-11-02T20:51:27Z
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> creator Denise E. Agosto
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> author Denise E. Agosto
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> xmptpg:npages 4
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> meta:author Denise E. Agosto
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> created Wed Oct 20 17:12:47 EDT 2010
> 2012-10-16 16:04:30,549 WARN parse.MetaTagsParser - Found meta tag :
> producer Adobe Acrobat 9.4 Paper Capture Plug-in
> 2012-10-16 16:04:30,550 WARN parse.MetaTagsParser - Found meta tag :
> last-save-date 2010-11-02T20:51:27Z
> 2012-10-16 16:04:30,550 WARN parse.MetaTagsParser - Found meta tag :
> dc:title ALAN v29n3 - Facilitating Student Connections to Judith Ortiz
> Cofer's The Line of the Sun and Esmeralda Santiago's Almost a Woman
> 2012-10-16 16:04:30,631 INFO parse.ParserJob - Parsing
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf
> 2012-10-16 16:04:30,680 WARN parse.MetaTagsParser - Found meta tag :
> content-type application/pdf
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> meta:creation-date 2010-10-20T21:00:15Z
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> dcterms:modified 2010-11-02T20:51:57Z
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> meta:save-date 2010-11-02T20:51:57Z
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> last-modified 2010-11-02T20:51:57Z
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> dcterms:created 2010-10-20T21:00:15Z
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> creation-date 2010-10-20T21:00:15Z
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag : date
> 2010-10-20T21:00:15Z
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> xmp:creatortool ScanWizard 5
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> modified 2010-11-02T20:51:57Z
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> xmptpg:npages 1
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> created Wed Oct 20 17:00:15 EDT 2010
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> producer Adobe Acrobat 9.4 Paper Capture Plug-in
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> last-save-date 2010-11-02T20:51:57Z
> 2012-10-16 16:04:30,681 WARN parse.MetaTagsParser - Found meta tag :
> dc:title ALAN v29n3 - INSTRUCTIONS FOR AUTHORS
> 2012-10-16 16:04:30,682 WARN parse.ParseUtil - Unable to successfully
> parse content
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type
> application/pdf
> 2012-10-16 16:04:30,692 INFO parse.ParserJob - Parsing
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/brown.pdf
>
Is there any way i can get more logs about knowing whether the error is
file specific or error from internal parser ?
Thank you,
--
Kiran Chitturi