Hi David

Do you clear the web server cache. Maybe the refetch is also crawl the old
page.

Maybe you can dump the url content to check the modification.
using bin/nutch readseg command.

Thanks


On Tue, Mar 5, 2013 at 1:28 PM, David Philip <[email protected]>wrote:

> Hi Markus,
>
>   So I was trying with the *db.injector.update *point that you mentioned,
> please see my observations below*. *
> Settings: I did  *db.injector.update * to* true *and   *
> db.fetch.interval.default *to* 1hour. *
> *
> *
> *
> *
> *Observation:*
>
> On first time crawl[1],  14 urls were successfully crawled and indexed to
> solr.
> case 1 :
> In those 14 urls I modified the content and title of one url (say Aurl) and
> re executed the crawl after one hour.
> I see that this(Aurl) url is re-fetched (it shows in log) but at Solr level
> : for that url (aurl): content field and title field didn't get updated.
> Why? should I do any configuration for this to make solr index get updated?
>
> case2:
> Added new url to the crawling site
> The url got indexed - This is success. So interested to know why the above
> case failed? What configuration need to be made?
>
>
> Thanks - David
>
>
> *PS:*
> Apologies that I am still asking questions on same topic. I am not able to
> find good way for incremental crawl so trying different approaches.  Once I
> am clear I will blog this and share it. Thanks lot for replies from mailer.
>
>
>
>
>
>
>
> On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma
> <[email protected]>wrote:
>
> > You can simply reinject the records.  You can overwrite and/or update the
> > current record. See the db.injector.update and overwrite settings.
> >
> > -----Original message-----
> > > From:David Philip <[email protected]>
> > > Sent: Wed 27-Feb-2013 11:23
> > > To: [email protected]
> > > Subject: Re: Nutch Incremental Crawl
> > >
> > > HI Markus, I meant over riding  the injected interval.. How to override
> > the
> > > injected fetch interval?
> > > While crawling fetch interval was set 30days (default). Now I want to
> > > re-fetch same site (that is to force re-fetch) and not wait for fetch
> > > interval (30 days).. how can we do that?
> > >
> > >
> > > Feng Lu : Thank you for the reference link.
> > >
> > > Thanks - David
> > >
> > >
> > >
> > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
> > > <[email protected]>wrote:
> > >
> > > > The default or the injected interval? The default interval can be set
> >  in
> > > > the config (see nutch-default for example). Per URL's can be set
> using
> > the
> > > > injector: <URL>\tnutch.fixedFetchInterval=86400
> > > >
> > > >
> > > > -----Original message-----
> > > > > From:David Philip <[email protected]>
> > > > > Sent: Wed 27-Feb-2013 06:21
> > > > > To: [email protected]
> > > > > Subject: Re: Nutch Incremental Crawl
> > > > >
> > > > > Hi all,
> > > > >
> > > > >   Thank you very much for the replies. Very useful information to
> > > > > understand how incremental crawling can be achieved.
> > > > >
> > > > > Dear Markus:
> > > > > Can you please tell me how do I over ride this fetch interval ,
> > incase
> > > > if I
> > > > > require to fetch the page before the time interval is passed?
> > > > >
> > > > >
> > > > >
> > > > > Thanks very much
> > > > > - David
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
> > > > > <[email protected]>wrote:
> > > > >
> > > > > > If you want records to be fetched at a fixed interval its easier
> to
> > > > inject
> > > > > > them with a fixed fetch interval.
> > > > > >
> > > > > > nutch.fixedFetchInterval=86400
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Original message-----
> > > > > > > From:kemical <[email protected]>
> > > > > > > Sent: Thu 14-Feb-2013 10:15
> > > > > > > To: [email protected]
> > > > > > > Subject: Re: Nutch Incremental Crawl
> > > > > > >
> > > > > > > Hi David,
> > > > > > >
> > > > > > > You can also consider setting shorter fetch interval time with
> > nutch
> > > > > > inject.
> > > > > > > This way you'll set higher score (so the url is always taken in
> > > > priority
> > > > > > > when you generate a segment) and a fetch.interval of 1 day.
> > > > > > >
> > > > > > > If you have a case similar to me, you'll often want some
> homepage
> > > > fetch
> > > > > > each
> > > > > > > day but not their inlinks. What you can do is inject all your
> > seed
> > > > urls
> > > > > > > again (assuming those url are only homepages).
> > > > > > >
> > > > > > > #change nutch option so existing urls can be injected again in
> > > > > > > conf/nutch-default.xml or conf/nutch-site.xml
> > > > > > > db.injector.update=true
> > > > > > >
> > > > > > > #Add metadata to update score/fetch interval
> > > > > > > #the following line will concat to each line of your seed urls
> > files
> > > > with
> > > > > > > the new score / new interval
> > > > > > > perl -pi -e
> > > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
> > > > > > > [your_seed_url_dir]/*
> > > > > > >
> > > > > > > #run command
> > > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir]
> > > > > > >
> > > > > > > Now, the following crawl will take your urls in top priority
> and
> > > > crawl
> > > > > > them
> > > > > > > once a day. I've used my situation to illustrate the concept
> but
> > i
> > > > guess
> > > > > > you
> > > > > > > can tweek params to fit your needs.
> > > > > > >
> > > > > > > This way is useful when you want a regular fetch on some urls,
> if
> > > > it's
> > > > > > > occured rarely i guess freegen is the right choice.
> > > > > > >
> > > > > > > Best,
> > > > > > > Mike
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > View this message in context:
> > > > > >
> > > >
> >
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> > > > > > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to