Hi Lewis, thank you very much. I will try your solution.
2013/5/23 Lewis John Mcgibbney <[email protected]> > Hi Adriana, > If I were you I would switch your logging to DEBUG for the ParserJob > > - log4j.logger.org.apache.nutch.parse.ParserJob=INFO,cmdstdout > + log4j.logger.org.apache.nutch.parse.ParserJob=DEBUG,cmdstdout > > > recompile the code, then look closely at the parse chunk of the log to see > what parser is being used, and if there are any particular issues flagged > up @runtime. > > > On Thu, May 23, 2013 at 8:14 AM, Adriana Farina > <[email protected]>wrote: > > > Hi, > > > > I'm using Nutch 2.1 in distributed mode on top of Hadoop 1.0.4, with > HBase > > 0.90.4 as database. > > > > I wrote a Java class from which I run the crawling cycle, the code that > > implements the crawling cycle is the following: > > > > for (int i = 0; i < depth; i++) { > > batchid = generator.generate((Long) args.get(Nutch.ARG_TOPN), > > System.currentTimeMillis(), false, false); > > fetcher.fetch(batchid, 1, false, -1); > > parser.parse(batchid, false, true); > > updater.run(new String[0]); > > } > > > > The problem is that I'm not able to parse the pdf files, inside HBase I > got > > no pdf content. The strange thing is that I got one row with the > following > > content: column=p:parsestat, timestamp=1369316742871, > > value=\x04\x90\x03\x02\x96\x01org.apache.nutch.parse.ParseException: > Unable > > to successfully parse content\x00. > > > > It seems to me that I have configured all nutch property files correctly. > > Can anybody help me? > > > > Thank you very much. > > > > > > -- > > Adriana Farina > > > > > > -- > *Lewis* > -- Adriana Farina

