By the way I use hbase instead of a crawldb folder?

2013/3/21 Markus Jelsma-2 [via Lucene] <
[email protected]>

> The CrawlDB contains information on all URL's and their status e.g. what
> HTTP code did they get, the interval, some metadata and their fetch time.
> Use the readdb command to inspect a specific URL.
>
>
>
> -----Original message-----
>
> > From:kamaci <[hidden 
> > email]<http://user/SendEmail.jtp?type=node&node=4049572&i=0>>
>
> > Sent: Wed 20-Mar-2013 23:52
> > To: [hidden email]<http://user/SendEmail.jtp?type=node&node=4049572&i=1>
> > Subject: Re: Does Nutch Checks Whether A Page crawled before or not
> >
> > Where does Nutch stores that information?
> >
> > 2013/3/21 Markus Jelsma-2 [via Lucene] <
> > [hidden email] <http://user/SendEmail.jtp?type=node&node=4049572&i=2>>
> >
> > > Nutch selects records that are eligible for fetch. It's either due to
> a
> > > transient failure or if the fetch interval has been expired. This
> means
> > > that failed fetches due to network issues are refetched within 24
> hours.
> > > Successfully fetched pages are only refetched if the current time
> exceeds
> > > the previously fetchTime + interval.
> > >
> > >
> > >
> > > -----Original message-----
> > >
> > > > From:kamaci <[hidden email]<
> http://user/SendEmail.jtp?type=node&node=4049568&i=0>>
> > >
> > > > Sent: Wed 20-Mar-2013 23:46
> > > > To: [hidden email]<
> http://user/SendEmail.jtp?type=node&node=4049568&i=1>
> > > > Subject: Does Nutch Checks Whether A Page crawled before or not
> > > >
> > > > Lets assume that I am crawling wikipedia.org with depth 1 and topN
> 1.
> > > After
> > > > it finishes crawling if I rerun that command and after finishes
> again
> > > and
> > > > again. What happens? Does Nutch skips previous fetched pages or try
> to
> > > crawl
> > > > same pages again?
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > >
> http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564.html
> > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > >
> > >
> > >
> > > ------------------------------
> > >  If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049568.html
> > >  To unsubscribe from Does Nutch Checks Whether A Page crawled before
> or
> > > not, click here<
>
> > > .
> > > NAML<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> > >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049569.html
>
> > Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049572.html
>  To unsubscribe from Does Nutch Checks Whether A Page crawled before or
> not, click 
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4049564&code=ZnVya2Fua2FtYWNpQGdtYWlsLmNvbXw0MDQ5NTY0fDEyODM4MDc0Mg==>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049579.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to