Customized Nutch Run + Reentrancy on parallel NUTCH runs

Fabio Ricci Thu, 13 Apr 2017 05:09:07 -0700

Hi again

two questions.


a)
I would like users of my application to harvest some document “outside” from 
the web, crawling an (actually limited) number of URL’s.
This is not the major activity in my application, this is just to give users 
the possibility to collect a text corpus easily from the web.

What I observed so far on NUTCH - and what I think I would need, are the 
following points AT EARCH RUN

1) Control / input a search depth (1 to 3 max) - (yes, with param Num Rounds)
2) Control/ input the maximal to crawl documents - not clear how
3) Control / input a blacklist 
4) Control / input the crawling value (“who is crawling”) in the properties 
“http.agent.name”


b)
Furthermore ono small but important question: Provided my system constructs a 
call by every time by giving separated url-seed-dirs and crawl-dirs, how 
reentrant is NUTCH in this case? Imagine 3 different NUTCH processes (with 
different data directory of course) are started at moreorless the same time. 
Shell I count on some strange side effects influencing the runs mutually or 
will NUTCH process each more or less synchronic run separately and 
independently?

Thanks a lot for your hints / answer
Fabio

Customized Nutch Run + Reentrancy on parallel NUTCH runs

Reply via email to