By the way I use hbase instead of a crawldb folder? 2013/3/21 Markus Jelsma-2 [via Lucene] < [email protected]>
> The CrawlDB contains information on all URL's and their status e.g. what > HTTP code did they get, the interval, some metadata and their fetch time. > Use the readdb command to inspect a specific URL. > > > > -----Original message----- > > > From:kamaci <[hidden > > email]<http://user/SendEmail.jtp?type=node&node=4049572&i=0>> > > > Sent: Wed 20-Mar-2013 23:52 > > To: [hidden email]<http://user/SendEmail.jtp?type=node&node=4049572&i=1> > > Subject: Re: Does Nutch Checks Whether A Page crawled before or not > > > > Where does Nutch stores that information? > > > > 2013/3/21 Markus Jelsma-2 [via Lucene] < > > [hidden email] <http://user/SendEmail.jtp?type=node&node=4049572&i=2>> > > > > > Nutch selects records that are eligible for fetch. It's either due to > a > > > transient failure or if the fetch interval has been expired. This > means > > > that failed fetches due to network issues are refetched within 24 > hours. > > > Successfully fetched pages are only refetched if the current time > exceeds > > > the previously fetchTime + interval. > > > > > > > > > > > > -----Original message----- > > > > > > > From:kamaci <[hidden email]< > http://user/SendEmail.jtp?type=node&node=4049568&i=0>> > > > > > > > Sent: Wed 20-Mar-2013 23:46 > > > > To: [hidden email]< > http://user/SendEmail.jtp?type=node&node=4049568&i=1> > > > > Subject: Does Nutch Checks Whether A Page crawled before or not > > > > > > > > Lets assume that I am crawling wikipedia.org with depth 1 and topN > 1. > > > After > > > > it finishes crawling if I rerun that command and after finishes > again > > > and > > > > again. What happens? Does Nutch skips previous fetched pages or try > to > > > crawl > > > > same pages again? > > > > > > > > > > > > > > > > -- > > > > View this message in context: > > > > http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564.html > > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > > > > > > > > ------------------------------ > > > If you reply to this email, your message will be added to the > discussion > > > below: > > > > > > > http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049568.html > > > To unsubscribe from Does Nutch Checks Whether A Page crawled before > or > > > not, click here< > > > > . > > > NAML< > http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > > > > > > > > > > > > > -- > > View this message in context: > http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049569.html > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049572.html > To unsubscribe from Does Nutch Checks Whether A Page crawled before or > not, click > here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4049564&code=ZnVya2Fua2FtYWNpQGdtYWlsLmNvbXw0MDQ5NTY0fDEyODM4MDc0Mg==> > . > NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049579.html Sent from the Nutch - User mailing list archive at Nabble.com.

