Hi

> Sorry about that. I did not notice the parsecodes are actually nutch and
> not tika.
>
> no problems!


> The setup is local on Mac desktop and i am using through command line and
> remote debugging through eclipse (
> http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse
> ).
>

OK

>
> I have set both http.content.limit and file.content.limit to -1. The logs
> just say 'WARN  parse.ParseUtil - Unable to successfully parse content
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type
> application/pdf'.
>

you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right? (not
in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean runtime')


>
> All the html's are getting parsed and when i crawl this page (
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's and some
> of the pdf files get parsed. Like, half of the pdf files get parsed and the
> other half don't get parsed.
>

do the ones that are not parsed have something in common? length?


> I am not sure about what causing the problem as you said parsechecker is
> actually work. I want the parser to crawl the full-text of the pdf and the
> metadata, title.
>

OK


>
> The metatags are also getting crawled for failed pdf parsing.
>

They would be discarded because of the failure even if they
were successfully extracted indeed. The current mechanism does not cater
for semi-failures

J.

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to