
> Sorry about that. I did not notice the parsecodes are actually nutch and
> not tika.
> no problems!

> The setup is local on Mac desktop and i am using through command line and
> remote debugging through eclipse (
> http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse
> ).


> I have set both http.content.limit and file.content.limit to -1. The logs
> just say 'WARN  parse.ParseUtil - Unable to successfully parse content
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type
> application/pdf'.

you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right? (not
in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean runtime')

> All the html's are getting parsed and when i crawl this page (
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's and some
> of the pdf files get parsed. Like, half of the pdf files get parsed and the
> other half don't get parsed.

do the ones that are not parsed have something in common? length?

> I am not sure about what causing the problem as you said parsechecker is
> actually work. I want the parser to crawl the full-text of the pdf and the
> metadata, title.


> The metatags are also getting crawled for failed pdf parsing.

They would be discarded because of the failure even if they
were successfully extracted indeed. The current mechanism does not cater
for semi-failures


*Open Source Solutions for Text Engineering


Reply via email to