Thank you very much!
I'm reading the source code to figure the questions out, but it is 
really a big project.~~


On 2012?07?16? 16:45, Markus Jelsma-2 [via Lucene] wrote:
> Hi,
>
>
> -----Original message-----
>
> > From:IT_ailen <[hidden email] 
> </user/SendEmail.jtp?type=node&node=3995219&i=0>>
> > Sent: Mon 16-Jul-2012 07:40
> > To: [hidden email] </user/SendEmail.jtp?type=node&node=3995219&i=1>
> > Subject: Problems on using nutch
> >
> > Hi there,
> >  Recently I'm crawling some sites with Nutch, but there are several 
> problems
> > bothering me. I have searched some with Google and some forums like
> > nutch-user, but still gotten little help. So I have to list them as
> > following and hope you guys can do me a favor. Thanks~
> >  1. Can Nutch be interrupted when it is crawling? If it can be 
> interrupted,
> > what's the exact handling logic after it resumes; if not, must I 
> re-crawling
> > the whole sites(oh, that will be a really huge re-work.), or there 
> will be a
> > better solution?
>
> No you cannot resume a Nutch 1.x crawl. If it is interrupted for some 
> reason you must refetch all pages. This is not a problem is you work 
> with small segment sizes.
>
> >  2. How does the Nutch handle with some bad HTTP status like 307, 203?
>
> I'm not sure. You may want to check the ProtocolStatus and lib-http code.
>
> >  3. How does the crawl option depth work? For example, if I have 
> crawled
> > with a depth valued 3, what will the Nutch do when I re-crawl with
> > "depth=3". Will it regenerate the destine list of URLs from the most 
> recent
> > segment or all of them and the file of original seeds?
>
> It will follow outlinks to the 3rd depth from the original seeds.
>
> >  4. What kind of influences will be made when I manually remove some
> > subdirectories under the segments directory?
> > I've searched these questions but don't get clear answers, so I hope 
> you
> > guys maybe tell me what in your opinions, or we can discuss them here.
> > I'm reading the source code but that is a really huge work~~
>
> You can delete them if you don't need them anymore.
>
> >
> > -----
> > I'm what I am.
> > --
> > View this message in context: 
> http://lucene.472066.n3.nabble.com/Problems-on-using-nutch-tp3995207.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>
>
> ------------------------------------------------------------------------
> If you reply to this email, your message will be added to the 
> discussion below:
> http://lucene.472066.n3.nabble.com/Problems-on-using-nutch-tp3995207p3995219.html
>  
>
> To unsubscribe from Problems on using nutch, click here 
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3995207&code=enlsZnJlZXBhcmFkaXNlQGdtYWlsLmNvbXwzOTk1MjA3fDUyMTAwMTg1MQ==>.
> NAML 
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>  
>




-----
I'm what I am.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problems-on-using-nutch-tp3995207p3995220.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to