Re: Integrating Nutch

Sebastian Nagel Thu, 02 Aug 2012 11:51:24 -0700

> One question though: Is there a way to get some more verbose
> information out of the crawl process than just the logging information?
> I intend something like the urls crawled, the ones waiting to be crawled, 
> current status etc?
> Programmatically I can only infer at what stage the process is (injecting, 
> fetching etc.),
> but no details. Injector Generator and Fetcher classes seem not to contain 
> any useful
> methods for that purpose.


Many Nutch classes make use of Hadoop job counters (look for 
org.apache.hadoop.mapred.Reporter).
But I actually don't know how to access these counters from inside a Java 
application
for running jobs.

Another possibility is to run
 nutch readdb -stats   / CrawlDbReader#processStatJob
after each cycle which provides the number of fetched, unfetched, failed, etc.
documents.

On 08/01/2012 03:00 PM, jasimop wrote:
>> Resources such as the URL filter and normalizer rule files 
>> are usually defined as pure files without path and are located 
>> on the classpath. So it should work if 
>>  C:/server/nutch/conf/ 
>> is in the classpath and the resources are simply named "regex-urlfilter.txt" 
>> resp. "regex-normalize.xml". 
> 
> Thanks for the information. It works now by putting the files into the 
> classpath and
> just using the filenames.
> Everything works now and I can start a crawl cycle from my Java application.
> One question though: Is there a way to get some more verbose
> information out of the crawl process than just the logging information?
> I intend something like the urls crawled, the ones waiting to be crawled, 
> current status etc?
> Programmatically I can only infer at what stage the process is (injecting, 
> fetching etc.),
> but no details. Injector Generator and Fetcher classes seem not to contain 
> any useful
> methods for that purpose.
> Any hints?
> 
> Regards,
> 
> Max
> 
>  
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3998591.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Integrating Nutch

Reply via email to