Hi
> Sorry about that. I did not notice the parsecodes are actually nutch and > not tika. > > no problems! > The setup is local on Mac desktop and i am using through command line and > remote debugging through eclipse ( > http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse > ). > OK > > I have set both http.content.limit and file.content.limit to -1. The logs > just say 'WARN parse.ParseUtil - Unable to successfully parse content > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type > application/pdf'. > you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right? (not in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean runtime') > > All the html's are getting parsed and when i crawl this page ( > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's and some > of the pdf files get parsed. Like, half of the pdf files get parsed and the > other half don't get parsed. > do the ones that are not parsed have something in common? length? > I am not sure about what causing the problem as you said parsechecker is > actually work. I want the parser to crawl the full-text of the pdf and the > metadata, title. > OK > > The metatags are also getting crawled for failed pdf parsing. > They would be discarded because of the failure even if they were successfully extracted indeed. The current mechanism does not cater for semi-failures J. -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble