Re: Nutch Incremental Crawl

feng lu Mon, 04 Mar 2013 21:56:40 -0800

Hi David

Do you clear the web server cache. Maybe the refetch is also crawl the old
page.


Maybe you can dump the url content to check the modification.
using bin/nutch readseg command.

Thanks


On Tue, Mar 5, 2013 at 1:28 PM, David Philip <[email protected]>wrote:

> Hi Markus,
>
>   So I was trying with the *db.injector.update *point that you mentioned,
> please see my observations below*. *
> Settings: I did  *db.injector.update * to* true *and   *
> db.fetch.interval.default *to* 1hour. *
> *
> *
> *
> *
> *Observation:*
>
> On first time crawl[1],  14 urls were successfully crawled and indexed to
> solr.
> case 1 :
> In those 14 urls I modified the content and title of one url (say Aurl) and
> re executed the crawl after one hour.
> I see that this(Aurl) url is re-fetched (it shows in log) but at Solr level
> : for that url (aurl): content field and title field didn't get updated.
> Why? should I do any configuration for this to make solr index get updated?
>
> case2:
> Added new url to the crawling site
> The url got indexed - This is success. So interested to know why the above
> case failed? What configuration need to be made?
>
>
> Thanks - David
>
>
> *PS:*
> Apologies that I am still asking questions on same topic. I am not able to
> find good way for incremental crawl so trying different approaches.  Once I
> am clear I will blog this and share it. Thanks lot for replies from mailer.
>
>
>
>
>
>
>
> On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
> <[email protected]>wrote:
>
> > You can simply reinject the records.  You can overwrite and/or update the
> > current record. See the db.injector.update and overwrite settings.
> >
> > -----Original message-----
> > > From:David Philip <[email protected]>
> > > Sent: Wed 27-Feb-2013 11:23
> > > To: [email protected]
> > > Subject: Re: Nutch Incremental Crawl
> > >
> > > HI Markus, I meant over riding  the injected interval.. How to override
> > the
> > > injected fetch interval?
> > > While crawling fetch interval was set 30days (default). Now I want to
> > > re-fetch same site (that is to force re-fetch) and not wait for fetch
> > > interval (30 days).. how can we do that?
> > >
> > >
> > > Feng Lu : Thank you for the reference link.
> > >
> > > Thanks - David
> > >
> > >
> > >
> > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
> > > <[email protected]>wrote:
> > >
> > > > The default or the injected interval? The default interval can be set
> >  in
> > > > the config (see nutch-default for example). Per URL's can be set
> using
> > the
> > > > injector: <URL>\tnutch.fixedFetchInterval=86400
> > > >
> > > >
> > > > -----Original message-----
> > > > > From:David Philip <[email protected]>
> > > > > Sent: Wed 27-Feb-2013 06:21
> > > > > To: [email protected]
> > > > > Subject: Re: Nutch Incremental Crawl
> > > > >
> > > > > Hi all,
> > > > >
> > > > >   Thank you very much for the replies. Very useful information to
> > > > > understand how incremental crawling can be achieved.
> > > > >
> > > > > Dear Markus:
> > > > > Can you please tell me how do I over ride this fetch interval ,
> > incase
> > > > if I
> > > > > require to fetch the page before the time interval is passed?
> > > > >
> > > > >
> > > > >
> > > > > Thanks very much
> > > > > - David
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
> > > > > <[email protected]>wrote:
> > > > >
> > > > > > If you want records to be fetched at a fixed interval its easier
> to
> > > > inject
> > > > > > them with a fixed fetch interval.
> > > > > >
> > > > > > nutch.fixedFetchInterval=86400
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Original message-----
> > > > > > > From:kemical <[email protected]>
> > > > > > > Sent: Thu 14-Feb-2013 10:15
> > > > > > > To: [email protected]
> > > > > > > Subject: Re: Nutch Incremental Crawl
> > > > > > >
> > > > > > > Hi David,
> > > > > > >
> > > > > > > You can also consider setting shorter fetch interval time with
> > nutch
> > > > > > inject.
> > > > > > > This way you'll set higher score (so the url is always taken in
> > > > priority
> > > > > > > when you generate a segment) and a fetch.interval of 1 day.
> > > > > > >
> > > > > > > If you have a case similar to me, you'll often want some
> homepage
> > > > fetch
> > > > > > each
> > > > > > > day but not their inlinks. What you can do is inject all your
> > seed
> > > > urls
> > > > > > > again (assuming those url are only homepages).
> > > > > > >
> > > > > > > #change nutch option so existing urls can be injected again in
> > > > > > > conf/nutch-default.xml or conf/nutch-site.xml
> > > > > > > db.injector.update=true
> > > > > > >
> > > > > > > #Add metadata to update score/fetch interval
> > > > > > > #the following line will concat to each line of your seed urls
> > files
> > > > with
> > > > > > > the new score / new interval
> > > > > > > perl -pi -e
> > > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
> > > > > > > [your_seed_url_dir]/*
> > > > > > >
> > > > > > > #run command
> > > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir]
> > > > > > >
> > > > > > > Now, the following crawl will take your urls in top priority
> and
> > > > crawl
> > > > > > them
> > > > > > > once a day. I've used my situation to illustrate the concept
> but
> > i
> > > > guess
> > > > > > you
> > > > > > > can tweek params to fit your needs.
> > > > > > >
> > > > > > > This way is useful when you want a regular fetch on some urls,
> if
> > > > it's
> > > > > > > occured rarely i guess freegen is the right choice.
> > > > > > >
> > > > > > > Best,
> > > > > > > Mike
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > View this message in context:
> > > > > >
> > > >
> >
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> > > > > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Nutch Incremental Crawl

Reply via email to