> One question though: Is there a way to get some more verbose > information out of the crawl process than just the logging information? > I intend something like the urls crawled, the ones waiting to be crawled, > current status etc? > Programmatically I can only infer at what stage the process is (injecting, > fetching etc.), > but no details. Injector Generator and Fetcher classes seem not to contain > any useful > methods for that purpose.
Many Nutch classes make use of Hadoop job counters (look for org.apache.hadoop.mapred.Reporter). But I actually don't know how to access these counters from inside a Java application for running jobs. Another possibility is to run nutch readdb -stats / CrawlDbReader#processStatJob after each cycle which provides the number of fetched, unfetched, failed, etc. documents. On 08/01/2012 03:00 PM, jasimop wrote: >> Resources such as the URL filter and normalizer rule files >> are usually defined as pure files without path and are located >> on the classpath. So it should work if >> C:/server/nutch/conf/ >> is in the classpath and the resources are simply named "regex-urlfilter.txt" >> resp. "regex-normalize.xml". > > Thanks for the information. It works now by putting the files into the > classpath and > just using the filenames. > Everything works now and I can start a crawl cycle from my Java application. > One question though: Is there a way to get some more verbose > information out of the crawl process than just the logging information? > I intend something like the urls crawled, the ones waiting to be crawled, > current status etc? > Programmatically I can only infer at what stage the process is (injecting, > fetching etc.), > but no details. Injector Generator and Fetcher classes seem not to contain > any useful > methods for that purpose. > Any hints? > > Regards, > > Max > > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3998591.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

