Problems on using nutch

IT_ailen Sun, 15 Jul 2012 22:39:47 -0700

Hi there,
 Recently I'm crawling some sites with Nutch, but there are several problems
bothering me. I have searched some with Google and some forums like
nutch-user, but still gotten little help. So I have to list them as
following and hope you guys can do me a favor. Thanks~
 1. Can Nutch be interrupted when it is crawling? If it can be interrupted,
what's the exact handling logic after it resumes; if not, must I re-crawling
the whole sites(oh, that will be a really huge re-work.), or there will be a
better solution?
 2. How does the Nutch handle with some bad HTTP status like 307, 203?
 3. How does the crawl option depth work? For example, if I have crawled
with a depth valued 3, what will the Nutch do when I re-crawl with
"depth=3". Will it regenerate the destine list of URLs from the most recent
segment or all of them and the file of original seeds?
 4. What kind of influences will be made when I manually remove some
subdirectories under the segments directory?
I've searched these questions but don't get clear answers, so I hope you
guys maybe tell me what in your opinions, or we can discuss them here.
I'm reading the source code but that is a really huge work~~


-----
I'm what I am.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problems-on-using-nutch-tp3995207.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Problems on using nutch

Reply via email to