Hi Julien,

I have just noticed something when running the parse.

First when i ran the parse command 'sh bin/nutch parse
1351188762-1772522488', the parsing of all the PDF files has failed.

When i ran the command again one pdf file got parsed. Next time, another
pdf file got parsed.

When i ran the parse command the number of times the total number of pdf
files, all the pdf files got parsed.

In my case,  i ran it 17 times and all the pdf files are parsed. Before
that, not everything is parsed.

This sounds strange, do you think it is some configuration problem ?

I have tried this 2 times and same thing happened two times for me .

I am not sure why this is happening.

Thanks for your help.

Regards,
Kiran.


On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi
>
>
> > Sorry about that. I did not notice the parsecodes are actually nutch and
> > not tika.
> >
> > no problems!
>
>
> > The setup is local on Mac desktop and i am using through command line and
> > remote debugging through eclipse (
> >
> http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse
> > ).
> >
>
> OK
>
> >
> > I have set both http.content.limit and file.content.limit to -1. The logs
> > just say 'WARN  parse.ParseUtil - Unable to successfully parse content
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type
> > application/pdf'.
> >
>
> you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right? (not
> in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean runtime')
>
>
> >
> > All the html's are getting parsed and when i crawl this page (
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's and
> some
> > of the pdf files get parsed. Like, half of the pdf files get parsed and
> the
> > other half don't get parsed.
> >
>
> do the ones that are not parsed have something in common? length?
>
>
> > I am not sure about what causing the problem as you said parsechecker is
> > actually work. I want the parser to crawl the full-text of the pdf and
> the
> > metadata, title.
> >
>
> OK
>
>
> >
> > The metatags are also getting crawled for failed pdf parsing.
> >
>
> They would be discarded because of the failure even if they
> were successfully extracted indeed. The current mechanism does not cater
> for semi-failures
>
> J.
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
Kiran Chitturi

Reply via email to