Hi,
 
 
-----Original message-----
> From:IT_ailen <[email protected]>
> Sent: Mon 16-Jul-2012 07:40
> To: [email protected]
> Subject: Problems on using nutch
> 
> Hi there,
>  Recently I'm crawling some sites with Nutch, but there are several problems
> bothering me. I have searched some with Google and some forums like
> nutch-user, but still gotten little help. So I have to list them as
> following and hope you guys can do me a favor. Thanks~
>  1. Can Nutch be interrupted when it is crawling? If it can be interrupted,
> what's the exact handling logic after it resumes; if not, must I re-crawling
> the whole sites(oh, that will be a really huge re-work.), or there will be a
> better solution?

No you cannot resume a Nutch 1.x crawl. If it is interrupted for some reason 
you must refetch all pages. This is not a problem is you work with small 
segment sizes.

>  2. How does the Nutch handle with some bad HTTP status like 307, 203?

I'm not sure. You may want to check the ProtocolStatus and lib-http code.

>  3. How does the crawl option depth work? For example, if I have crawled
> with a depth valued 3, what will the Nutch do when I re-crawl with
> "depth=3". Will it regenerate the destine list of URLs from the most recent
> segment or all of them and the file of original seeds?

It will follow outlinks to the 3rd depth from the original seeds.

>  4. What kind of influences will be made when I manually remove some
> subdirectories under the segments directory?
> I've searched these questions but don't get clear answers, so I hope you
> guys maybe tell me what in your opinions, or we can discuss them here.
> I'm reading the source code but that is a really huge work~~ 

You can delete them if you don't need them anymore.

> 
> -----
> I'm what I am.
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Problems-on-using-nutch-tp3995207.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to