nutch parse fails

Lewis John Mcgibbney Wed, 31 Oct 2012 10:58:05 -0700

Please make sure to recompile as well.

On Wed, Oct 31, 2012 at 5:55 PM,  <[email protected]> wrote:
> Hi,
>
> If you change this line
> log4j.logger.org.apache.nutch.parse.ParserJob=INFO,cmdstdout
> in runtime/local/conf/log4j.properties
>
> to
> log4j.logger.org.apache.nutch.parse.ParserJob=DEBUG,cmdstdout
>
> you must see more info about the parse process in logs.
>
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: kiran chitturi <[email protected]>
> To: user <[email protected]>
> Sent: Wed, Oct 31, 2012 10:01 am
> Subject: Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails
>
>
> Hi Julien,
>
> I have just noticed something when running the parse.
>
> First when i ran the parse command 'sh bin/nutch parse
> 1351188762-1772522488', the parsing of all the PDF files has failed.
>
> When i ran the command again one pdf file got parsed. Next time, another
> pdf file got parsed.
>
> When i ran the parse command the number of times the total number of pdf
> files, all the pdf files got parsed.
>
> In my case,  i ran it 17 times and all the pdf files are parsed. Before
> that, not everything is parsed.
>
> This sounds strange, do you think it is some configuration problem ?
>
> I have tried this 2 times and same thing happened two times for me .
>
> I am not sure why this is happening.
>
> Thanks for your help.
>
> Regards,
> Kiran.
>
>
> On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche <
> [email protected]> wrote:
>
>> Hi
>>
>>
>> > Sorry about that. I did not notice the parsecodes are actually nutch and
>> > not tika.
>> >
>> > no problems!
>>
>>
>> > The setup is local on Mac desktop and i am using through command line and
>> > remote debugging through eclipse (
>> >
>> http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse
>> > ).
>> >
>>
>> OK
>>
>> >
>> > I have set both http.content.limit and file.content.limit to -1. The logs
>> > just say 'WARN  parse.ParseUtil - Unable to successfully parse content
>> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type
>> > application/pdf'.
>> >
>>
>> you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right? (not
>> in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean runtime')
>>
>>
>> >
>> > All the html's are getting parsed and when i crawl this page (
>> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's and
>> some
>> > of the pdf files get parsed. Like, half of the pdf files get parsed and
>> the
>> > other half don't get parsed.
>> >
>>
>> do the ones that are not parsed have something in common? length?
>>
>>
>> > I am not sure about what causing the problem as you said parsechecker is
>> > actually work. I want the parser to crawl the full-text of the pdf and
>> the
>> > metadata, title.
>> >
>>
>> OK
>>
>>
>> >
>> > The metatags are also getting crawled for failed pdf parsing.
>> >
>>
>> They would be discarded because of the failure even if they
>> were successfully extracted indeed. The current mechanism does not cater
>> for semi-failures
>>
>> J.
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>
>
> --
> Kiran Chitturi
>
>




-- 
Lewis

Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails

Reply via email to