Hi again two questions.
a) I would like users of my application to harvest some document “outside” from the web, crawling an (actually limited) number of URL’s. This is not the major activity in my application, this is just to give users the possibility to collect a text corpus easily from the web. What I observed so far on NUTCH - and what I think I would need, are the following points AT EARCH RUN 1) Control / input a search depth (1 to 3 max) - (yes, with param Num Rounds) 2) Control/ input the maximal to crawl documents - not clear how 3) Control / input a blacklist 4) Control / input the crawling value (“who is crawling”) in the properties “http.agent.name” b) Furthermore ono small but important question: Provided my system constructs a call by every time by giving separated url-seed-dirs and crawl-dirs, how reentrant is NUTCH in this case? Imagine 3 different NUTCH processes (with different data directory of course) are started at moreorless the same time. Shell I count on some strange side effects influencing the runs mutually or will NUTCH process each more or less synchronic run separately and independently? Thanks a lot for your hints / answer Fabio

