Thank you very much! I'm reading the source code to figure the questions out, but it is really a big project.~~
On 2012?07?16? 16:45, Markus Jelsma-2 [via Lucene] wrote: > Hi, > > > -----Original message----- > > > From:IT_ailen <[hidden email] > </user/SendEmail.jtp?type=node&node=3995219&i=0>> > > Sent: Mon 16-Jul-2012 07:40 > > To: [hidden email] </user/SendEmail.jtp?type=node&node=3995219&i=1> > > Subject: Problems on using nutch > > > > Hi there, > > Recently I'm crawling some sites with Nutch, but there are several > problems > > bothering me. I have searched some with Google and some forums like > > nutch-user, but still gotten little help. So I have to list them as > > following and hope you guys can do me a favor. Thanks~ > > 1. Can Nutch be interrupted when it is crawling? If it can be > interrupted, > > what's the exact handling logic after it resumes; if not, must I > re-crawling > > the whole sites(oh, that will be a really huge re-work.), or there > will be a > > better solution? > > No you cannot resume a Nutch 1.x crawl. If it is interrupted for some > reason you must refetch all pages. This is not a problem is you work > with small segment sizes. > > > 2. How does the Nutch handle with some bad HTTP status like 307, 203? > > I'm not sure. You may want to check the ProtocolStatus and lib-http code. > > > 3. How does the crawl option depth work? For example, if I have > crawled > > with a depth valued 3, what will the Nutch do when I re-crawl with > > "depth=3". Will it regenerate the destine list of URLs from the most > recent > > segment or all of them and the file of original seeds? > > It will follow outlinks to the 3rd depth from the original seeds. > > > 4. What kind of influences will be made when I manually remove some > > subdirectories under the segments directory? > > I've searched these questions but don't get clear answers, so I hope > you > > guys maybe tell me what in your opinions, or we can discuss them here. > > I'm reading the source code but that is a really huge work~~ > > You can delete them if you don't need them anymore. > > > > > ----- > > I'm what I am. > > -- > > View this message in context: > http://lucene.472066.n3.nabble.com/Problems-on-using-nutch-tp3995207.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > ------------------------------------------------------------------------ > If you reply to this email, your message will be added to the > discussion below: > http://lucene.472066.n3.nabble.com/Problems-on-using-nutch-tp3995207p3995219.html > > > To unsubscribe from Problems on using nutch, click here > <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3995207&code=enlsZnJlZXBhcmFkaXNlQGdtYWlsLmNvbXwzOTk1MjA3fDUyMTAwMTg1MQ==>. > NAML > <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > ----- I'm what I am. -- View this message in context: http://lucene.472066.n3.nabble.com/Problems-on-using-nutch-tp3995207p3995220.html Sent from the Nutch - User mailing list archive at Nabble.com.

