trimmed content?

On 16 October 2012 22:47, kiran chitturi <[email protected]> wrote:

> Hi,
>
> I am running Nutch 2.x with patch here at
> https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a mysql
> database.
>
> After the {inject, generate, fetch} commands when i issue the command (sh
> bin/nutch parse 1350396627-126726428) the parserJob was success but when i
> look inside the database only one pdf file is parsed out of 10.
>
> When i look in to hadoop.log it shows the statement '2012-10-16
> 16:04:30,682 WARN  parse.ParseUtil - Unable to successfully parse content
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type
> application/pdf' like this.
>
> The logs of successfully parsed and failed ones are below. The logs below
> show that pdf file '......./agosto.pdf' is parsed and the file
> '..../authors.pdf' is not parsed.
>
> The same thing happened for all other pdf files, the parse failed. When i
> do the 'sh bin/nutch parsechecker {url}' it worked with the failed pdf
> files and it does not show any errors.
>
>
> 2012-10-16 16:04:28,150 INFO  parse.ParserJob - Parsing
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf
> > 2012-10-16 16:04:28,151 INFO  parse.ParserFactory - The parsing plugins:
> > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > plugin.includes system property, and all claim to support the content
> type
> > application/pdf, but they are not mapp
> > ed to it  in the parse-plugins.xml file
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > content-type      application/pdf
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > dcterms:modified  2010-11-02T20:51:27Z
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > meta:creation-date        2010-10-20T21:12:47Z
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > meta:save-date    2010-11-02T20:51:27Z
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > last-modified     2010-11-02T20:51:27Z
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > dc:creator        Denise E. Agosto
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > dcterms:created   2010-10-20T21:12:47Z
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > creation-date     2010-10-20T21:12:47Z
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> date
> >      2010-10-20T21:12:47Z
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > xmp:creatortool   ScanWizard 5
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > modified  2010-11-02T20:51:27Z
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > creator   Denise E. Agosto
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > author    Denise E. Agosto
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > xmptpg:npages     4
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > meta:author       Denise E. Agosto
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > created   Wed Oct 20 17:12:47 EDT 2010
> > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > producer  Adobe Acrobat 9.4 Paper Capture Plug-in
> > 2012-10-16 16:04:30,550 WARN  parse.MetaTagsParser - Found meta tag :
> > last-save-date    2010-11-02T20:51:27Z
> > 2012-10-16 16:04:30,550 WARN  parse.MetaTagsParser - Found meta tag :
> > dc:title  ALAN v29n3 - Facilitating Student Connections to Judith Ortiz
> > Cofer's The Line of the Sun and Esmeralda Santiago's Almost a Woman
> > 2012-10-16 16:04:30,631 INFO  parse.ParserJob - Parsing
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf
> > 2012-10-16 16:04:30,680 WARN  parse.MetaTagsParser - Found meta tag :
> > content-type      application/pdf
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > meta:creation-date        2010-10-20T21:00:15Z
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > dcterms:modified  2010-11-02T20:51:57Z
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > meta:save-date    2010-11-02T20:51:57Z
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > last-modified     2010-11-02T20:51:57Z
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > dcterms:created   2010-10-20T21:00:15Z
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > creation-date     2010-10-20T21:00:15Z
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> date
> >      2010-10-20T21:00:15Z
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > xmp:creatortool   ScanWizard 5
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > modified  2010-11-02T20:51:57Z
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > xmptpg:npages     1
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > created   Wed Oct 20 17:00:15 EDT 2010
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > producer  Adobe Acrobat 9.4 Paper Capture Plug-in
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > last-save-date    2010-11-02T20:51:57Z
> > 2012-10-16 16:04:30,681 WARN  parse.MetaTagsParser - Found meta tag :
> > dc:title  ALAN v29n3 - INSTRUCTIONS FOR AUTHORS
> > 2012-10-16 16:04:30,682 WARN  parse.ParseUtil - Unable to successfully
> > parse content
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type
> > application/pdf
> > 2012-10-16 16:04:30,692 INFO  parse.ParserJob - Parsing
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/brown.pdf
> >
>
> Is there any way i can get more logs about knowing whether the error is
> file specific or error from internal parser ?
>
> Thank you,
> --
> Kiran Chitturi
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to