Please make sure to recompile as well. On Wed, Oct 31, 2012 at 5:55 PM, <[email protected]> wrote: > Hi, > > If you change this line > log4j.logger.org.apache.nutch.parse.ParserJob=INFO,cmdstdout > in runtime/local/conf/log4j.properties > > to > log4j.logger.org.apache.nutch.parse.ParserJob=DEBUG,cmdstdout > > you must see more info about the parse process in logs. > > Alex. > > > > > > > > -----Original Message----- > From: kiran chitturi <[email protected]> > To: user <[email protected]> > Sent: Wed, Oct 31, 2012 10:01 am > Subject: Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails > > > Hi Julien, > > I have just noticed something when running the parse. > > First when i ran the parse command 'sh bin/nutch parse > 1351188762-1772522488', the parsing of all the PDF files has failed. > > When i ran the command again one pdf file got parsed. Next time, another > pdf file got parsed. > > When i ran the parse command the number of times the total number of pdf > files, all the pdf files got parsed. > > In my case, i ran it 17 times and all the pdf files are parsed. Before > that, not everything is parsed. > > This sounds strange, do you think it is some configuration problem ? > > I have tried this 2 times and same thing happened two times for me . > > I am not sure why this is happening. > > Thanks for your help. > > Regards, > Kiran. > > > On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche < > [email protected]> wrote: > >> Hi >> >> >> > Sorry about that. I did not notice the parsecodes are actually nutch and >> > not tika. >> > >> > no problems! >> >> >> > The setup is local on Mac desktop and i am using through command line and >> > remote debugging through eclipse ( >> > >> http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse >> > ). >> > >> >> OK >> >> > >> > I have set both http.content.limit and file.content.limit to -1. The logs >> > just say 'WARN parse.ParseUtil - Unable to successfully parse content >> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type >> > application/pdf'. >> > >> >> you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right? (not >> in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean runtime') >> >> >> > >> > All the html's are getting parsed and when i crawl this page ( >> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's and >> some >> > of the pdf files get parsed. Like, half of the pdf files get parsed and >> the >> > other half don't get parsed. >> > >> >> do the ones that are not parsed have something in common? length? >> >> >> > I am not sure about what causing the problem as you said parsechecker is >> > actually work. I want the parser to crawl the full-text of the pdf and >> the >> > metadata, title. >> > >> >> OK >> >> >> > >> > The metatags are also getting crawled for failed pdf parsing. >> > >> >> They would be discarded because of the failure even if they >> were successfully extracted indeed. The current mechanism does not cater >> for semi-failures >> >> J. >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> > > > > -- > Kiran Chitturi > >
-- Lewis

