Hi, -----Original message----- > From:IT_ailen <[email protected]> > Sent: Mon 16-Jul-2012 07:40 > To: [email protected] > Subject: Problems on using nutch > > Hi there, > Recently I'm crawling some sites with Nutch, but there are several problems > bothering me. I have searched some with Google and some forums like > nutch-user, but still gotten little help. So I have to list them as > following and hope you guys can do me a favor. Thanks~ > 1. Can Nutch be interrupted when it is crawling? If it can be interrupted, > what's the exact handling logic after it resumes; if not, must I re-crawling > the whole sites(oh, that will be a really huge re-work.), or there will be a > better solution?
No you cannot resume a Nutch 1.x crawl. If it is interrupted for some reason you must refetch all pages. This is not a problem is you work with small segment sizes. > 2. How does the Nutch handle with some bad HTTP status like 307, 203? I'm not sure. You may want to check the ProtocolStatus and lib-http code. > 3. How does the crawl option depth work? For example, if I have crawled > with a depth valued 3, what will the Nutch do when I re-crawl with > "depth=3". Will it regenerate the destine list of URLs from the most recent > segment or all of them and the file of original seeds? It will follow outlinks to the 3rd depth from the original seeds. > 4. What kind of influences will be made when I manually remove some > subdirectories under the segments directory? > I've searched these questions but don't get clear answers, so I hope you > guys maybe tell me what in your opinions, or we can discuss them here. > I'm reading the source code but that is a really huge work~~ You can delete them if you don't need them anymore. > > ----- > I'm what I am. > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Problems-on-using-nutch-tp3995207.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

