nutch parse fails

kiran chitturi Wed, 31 Oct 2012 07:09:08 -0700

Hi Julien,

Sorry about that. I did not notice the parsecodes are actually nutch and
not tika.

The setup is local on Mac desktop and i am using through command line and
remote debugging through eclipse (
http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse
).

I have set both http.content.limit and file.content.limit to -1. The logs
just say 'WARN  parse.ParseUtil - Unable to successfully parse content
http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type
application/pdf'.

All the html's are getting parsed and when i crawl this page (
http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's and some
of the pdf files get parsed. Like, half of the pdf files get parsed and the
other half don't get parsed.

I am not sure about what causing the problem as you said parsechecker is
actually work. I want the parser to crawl the full-text of the pdf and the
metadata, title.

The metatags are also getting crawled for failed pdf parsing.

Please let me know if any additional information is need.

Thanks for the help.

Regards.

On Wed, Oct 31, 2012 at 9:59 AM, Julien Nioche <
[email protected]> wrote:

> hi Kiran
>
>
> > Does anyone know why i am having this conflict ? I feel thats because of
> > the Tika parser parsecodes (Major Code and Minor code) but i have not
> been
> > able to figure out why this happened.
>
>
> as explained earlier you are confusing cause and consequences here. the
> parsing does not fail because of the codes but the codes indicate that it
> fails
>
> there is no point in bothering people in the Tika list as the codes are not
> related to tika but are 100% Nutch
>
> please give more info about your setup : local? psuedo -distributed?
> running from the command line? Have you checked that the content limit is
> really taken into account? What messages are you getting in the logs?
> etc....
>
> Thanks
>
> Julien
>
>
> On 31 October 2012 13:53, kiran chitturi <[email protected]>
> wrote:
>
> > Hi,
> >
> > I have mailed the list previously about Tika parse Codes (major code and
> > minor code) and as Julien pointed out here
> > http://www.mail-archive.com/user%40nutch.apache.org/msg07950.html 'sh
> > bin/nutch parsechecker -dumpText
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf' works but
> > when i do 'sh bin/nutch parse <crawlId> that includes the above pdf file
> > then i see this message in the logs
> >
> > 'WARN  parse.ParseUtil - Unable to successfully parse content
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type
> > application/pdf'
> >
> > Does anyone know why i am having this conflict ? I feel thats because of
> > the Tika parser parsecodes (Major Code and Minor code) but i have not
> been
> > able to figure out why this happened.
> >
> > Did anyone encounter this problem before ? I am also gonna post in tika
> > mailing list about what the codes mean ?
> >
> >
> > Regards,
> >
> > --
> > Kiran Chitturi
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

-- 
Kiran Chitturi

Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails

Reply via email to