Re: Integrating Nutch

jasimop Wed, 01 Aug 2012 06:01:21 -0700

> Resources such as the URL filter and normalizer rule files 
> are usually defined as pure files without path and are located 
> on the classpath. So it should work if 
>  C:/server/nutch/conf/ 
> is in the classpath and the resources are simply named "regex-urlfilter.txt" 
> resp. "regex-normalize.xml".


Thanks for the information. It works now by putting the files into the 
classpath and
just using the filenames.
Everything works now and I can start a crawl cycle from my Java application.
One question though: Is there a way to get some more verbose
information out of the crawl process than just the logging information?
I intend something like the urls crawled, the ones waiting to be crawled, 
current status etc?
Programmatically I can only infer at what stage the process is (injecting, 
fetching etc.),
but no details. Injector Generator and Fetcher classes seem not to contain any 
useful
methods for that purpose.
Any hints?

Regards,

Max

 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3998591.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Integrating Nutch

Reply via email to