Hi Julien, I have just noticed something when running the parse.
First when i ran the parse command 'sh bin/nutch parse 1351188762-1772522488', the parsing of all the PDF files has failed. When i ran the command again one pdf file got parsed. Next time, another pdf file got parsed. When i ran the parse command the number of times the total number of pdf files, all the pdf files got parsed. In my case, i ran it 17 times and all the pdf files are parsed. Before that, not everything is parsed. This sounds strange, do you think it is some configuration problem ? I have tried this 2 times and same thing happened two times for me . I am not sure why this is happening. Thanks for your help. Regards, Kiran. On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi > > > > Sorry about that. I did not notice the parsecodes are actually nutch and > > not tika. > > > > no problems! > > > > The setup is local on Mac desktop and i am using through command line and > > remote debugging through eclipse ( > > > http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse > > ). > > > > OK > > > > > I have set both http.content.limit and file.content.limit to -1. The logs > > just say 'WARN parse.ParseUtil - Unable to successfully parse content > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type > > application/pdf'. > > > > you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right? (not > in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean runtime') > > > > > > All the html's are getting parsed and when i crawl this page ( > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's and > some > > of the pdf files get parsed. Like, half of the pdf files get parsed and > the > > other half don't get parsed. > > > > do the ones that are not parsed have something in common? length? > > > > I am not sure about what causing the problem as you said parsechecker is > > actually work. I want the parser to crawl the full-text of the pdf and > the > > metadata, title. > > > > OK > > > > > > The metatags are also getting crawled for failed pdf parsing. > > > > They would be discarded because of the failure even if they > were successfully extracted indeed. The current mechanism does not cater > for semi-failures > > J. > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi