Hi,

I am running Nutch 2.x with patch here at
https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a mysql
database.

After the {inject, generate, fetch} commands when i issue the command (sh
bin/nutch parse 1350396627-126726428) the parserJob was success but when i
look inside the database only one pdf file is parsed out of 10.

When i look in to hadoop.log it shows the statement '2012-10-16
16:04:30,682 WARN  parse.ParseUtil - Unable to successfully parse content
http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type
application/pdf' like this.

The logs of successfully parsed and failed ones are below. The logs below
show that pdf file '......./agosto.pdf' is parsed and the file
'..../authors.pdf' is not parsed.

The same thing happened for all other pdf files, the parse failed. When i
do the 'sh bin/nutch parsechecker {url}' it worked with the failed pdf
files and it does not show any errors.


2012-10-16 16:04:28,150 INFO  parse.ParserJob - Parsing
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf
> 2012-10-16 16:04:28,151 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content type
> application/pdf, but they are not mapp
> ed to it  in the parse-plugins.xml file
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> content-type      application/pdf
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> dcterms:modified  2010-11-02T20:51:27Z
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> meta:creation-date        2010-10-20T21:12:47Z
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> meta:save-date    2010-11-02T20:51:27Z
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> last-modified     2010-11-02T20:51:27Z
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> dc:creator        Denise E. Agosto
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> dcterms:created   2010-10-20T21:12:47Z
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> creation-date     2010-10-20T21:12:47Z
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag : date
>      2010-10-20T21:12:47Z
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> xmp:creatortool   ScanWizard 5
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> modified  2010-11-02T20:51:27Z
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> creator   Denise E. Agosto
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> author    Denise E. Agosto
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> xmptpg:npages     4
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> meta:author       Denise E. Agosto
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> created   Wed Oct 20 17:12:47 EDT 2010
> 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> producer  Adobe Acrobat 9.4 Paper Capture Plug-in
> 2012-10-16 16:04:30,550 WARN  parse.MetaTagsParser - Found meta tag :
> last-save-date    2010-11-02T20:51:27Z
> 2012-10-16 16:04:30,550 WARN  parse.MetaTagsParser - Found meta tag :
> dc:title  ALAN v29n3 - Facilitating Student Connections to Judith Ortiz
> Cofer's The Line of the Sun and Esmeralda Santiago's Almost a Woman
> 2012-10-16 16:04:30,631 INFO  parse.ParserJob - Parsing
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf
> 2012-10-16 16:04:30,680 WARN  parse.MetaTagsParser - Found meta tag :
> content-type      application/pdf
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> meta:creation-date        2010-10-20T21:00:15Z
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> dcterms:modified  2010-11-02T20:51:57Z
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> meta:save-date    2010-11-02T20:51:57Z
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> last-modified     2010-11-02T20:51:57Z
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> dcterms:created   2010-10-20T21:00:15Z
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> creation-date     2010-10-20T21:00:15Z
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag : date
>      2010-10-20T21:00:15Z
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> xmp:creatortool   ScanWizard 5
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> modified  2010-11-02T20:51:57Z
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> xmptpg:npages     1
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> created   Wed Oct 20 17:00:15 EDT 2010
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> producer  Adobe Acrobat 9.4 Paper Capture Plug-in
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> last-save-date    2010-11-02T20:51:57Z
> 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> dc:title  ALAN v29n3 - INSTRUCTIONS FOR AUTHORS
> 2012-10-16 16:04:30,682 WARN  parse.ParseUtil - Unable to successfully
> parse content
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type
> application/pdf
> 2012-10-16 16:04:30,692 INFO  parse.ParserJob - Parsing
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/brown.pdf
>

Is there any way i can get more logs about knowing whether the error is
file specific or error from internal parser ?

Thank you,
-- 
Kiran Chitturi

Reply via email to