Hi << I used less command and checked, it shows the past content , not modified one. Any other cache clearing from crawl db? or any property to set in nutch-site so that it does re-fetch modified content? >> As far as i know, the crawl db does not use cache. As Markus says that you can simply reinject the records. the nutch does not know which web page will re-fetch again, it only controled by fetchInterval in nutch-site configuration file.
Perhaps the only reason i can think is that the modified url fetch status is db_notmodified, so nutch will not download that url. Maybe you can use this command to check the status of that modified url. bin/nutch readdb crawldb/ -url http://www.example.com/ . if it's status is 6 indicated that web page is not modified. On Tue, Mar 5, 2013 at 7:48 PM, David Philip <[email protected]>wrote: > Hi, > I used less command and checked, it shows the past content , not modified > one. Any other cache clearing from crawl db? or any property to set in > nutch-site so that it does re-fetch modified content? > > > - Cleared tomcat cache > - settings: > > <property> > <name>db.fetch.interval.default</name> > <value>600</value> > </description> > </property> > > <property> > <name>db.injector.update</name> > <value>true</value> > </description> > </property> > > > > Crawl command : bin/nutch crawl urls -solr > http://localhost:8080/solrnutch-dir crawltest -depth 10 > This command I executed after 1 hour (modifying some sites content and > title) but the title or content is still not fetched. The dump (redseg > dump) shows old content only :( > > > To separately update solr, I executed this command : bin/nutch solrindex > http://localhost:8080/solrnutch/ crawltest/crawldb -linkdb > crawltest/linkdb > crawltest/segments/* -deleteGone > but no sucess, nothing updated to solr. > > *trace :* > SolrIndexer: starting at 2013-03-05 17:07:15 > SolrIndexer: deleting gone documents > Indexing 16 documents > Deleting 1 documents > SolrIndexer: finished at 2013-03-05 17:09:38, elapsed: 00:02:22 > > But after this , when I check in solr (http://localhost:8080/solrnutch/) > it still shows 16 docs, why it can be? I use nutch 1.5.1 version and > solr3.6 > > > Thanks - David > > P.S > I basically wanted to achieve on demand re-crawl so that all modified > website get updated in solr, and so when user searches, he gets accurate > results. > > > > > > > > > > > On Tue, Mar 5, 2013 at 12:54 PM, feng lu <[email protected]> wrote: > > > Hi David > > > > yes, it's a tomcat web service cache. > > > > The dump file can use "less" command to open if you use linux OS. or you > > can use > > "bin/nutch readseg -get segments/20130121115214/ http://www.cnbeta.com/" > > to > > dump the information of specified url. > > > > > > > > > > On Tue, Mar 5, 2013 at 3:02 PM, feng lu <[email protected]> wrote: > > > > > > > > > > > > > > On Tue, Mar 5, 2013 at 2:49 PM, David Philip < > > [email protected]>wrote: > > > > > >> Hi, > > >> > > >> web server cache - you mean /tomcat/work/; where the solr is > > running? > > >> Did u mean that cache? > > >> > > >> I tried to use the below command {bin/nutch readseg -dump > > >> crawltest/segments/20130304185844/ crawltest/test}and it gives dump > > file, > > >> format is GMC link (application/x-gmc-link) - I am not able to open > it. > > >> How to open this file? > > >> > > >> How ever when I ran : bin/nutch readseg -list > > >> crawltest/segments/20130304185844/ > > >> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED > > >> 20130304185844 1 2013-03-04T18:58:53 2013-03-04T18:58:53 1 1 > > >> > > >> > > >> - David > > >> > > >> > > >> > > >> > > >> > > >> On Tue, Mar 5, 2013 at 11:25 AM, feng lu <[email protected]> > wrote: > > >> > > >> > Hi David > > >> > > > >> > Do you clear the web server cache. Maybe the refetch is also crawl > the > > >> old > > >> > page. > > >> > > > >> > Maybe you can dump the url content to check the modification. > > >> > using bin/nutch readseg command. > > >> > > > >> > Thanks > > >> > > > >> > > > >> > On Tue, Mar 5, 2013 at 1:28 PM, David Philip < > > >> [email protected] > > >> > >wrote: > > >> > > > >> > > Hi Markus, > > >> > > > > >> > > So I was trying with the *db.injector.update *point that you > > >> mentioned, > > >> > > please see my observations below*. * > > >> > > Settings: I did *db.injector.update * to* true *and * > > >> > > db.fetch.interval.default *to* 1hour. * > > >> > > * > > >> > > * > > >> > > * > > >> > > * > > >> > > *Observation:* > > >> > > > > >> > > On first time crawl[1], 14 urls were successfully crawled and > > >> indexed to > > >> > > solr. > > >> > > case 1 : > > >> > > In those 14 urls I modified the content and title of one url (say > > >> Aurl) > > >> > and > > >> > > re executed the crawl after one hour. > > >> > > I see that this(Aurl) url is re-fetched (it shows in log) but at > > Solr > > >> > level > > >> > > : for that url (aurl): content field and title field didn't get > > >> updated. > > >> > > Why? should I do any configuration for this to make solr index get > > >> > updated? > > >> > > > > >> > > case2: > > >> > > Added new url to the crawling site > > >> > > The url got indexed - This is success. So interested to know why > the > > >> > above > > >> > > case failed? What configuration need to be made? > > >> > > > > >> > > > > >> > > Thanks - David > > >> > > > > >> > > > > >> > > *PS:* > > >> > > Apologies that I am still asking questions on same topic. I am not > > >> able > > >> > to > > >> > > find good way for incremental crawl so trying different > approaches. > > >> > Once I > > >> > > am clear I will blog this and share it. Thanks lot for replies > from > > >> > mailer. > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma > > >> > > <[email protected]>wrote: > > >> > > > > >> > > > You can simply reinject the records. You can overwrite and/or > > >> update > > >> > the > > >> > > > current record. See the db.injector.update and overwrite > settings. > > >> > > > > > >> > > > -----Original message----- > > >> > > > > From:David Philip <[email protected]> > > >> > > > > Sent: Wed 27-Feb-2013 11:23 > > >> > > > > To: [email protected] > > >> > > > > Subject: Re: Nutch Incremental Crawl > > >> > > > > > > >> > > > > HI Markus, I meant over riding the injected interval.. How to > > >> > override > > >> > > > the > > >> > > > > injected fetch interval? > > >> > > > > While crawling fetch interval was set 30days (default). Now I > > >> want to > > >> > > > > re-fetch same site (that is to force re-fetch) and not wait > for > > >> fetch > > >> > > > > interval (30 days).. how can we do that? > > >> > > > > > > >> > > > > > > >> > > > > Feng Lu : Thank you for the reference link. > > >> > > > > > > >> > > > > Thanks - David > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma > > >> > > > > <[email protected]>wrote: > > >> > > > > > > >> > > > > > The default or the injected interval? The default interval > can > > >> be > > >> > set > > >> > > > in > > >> > > > > > the config (see nutch-default for example). Per URL's can be > > set > > >> > > using > > >> > > > the > > >> > > > > > injector: <URL>\tnutch.fixedFetchInterval=86400 > > >> > > > > > > > >> > > > > > > > >> > > > > > -----Original message----- > > >> > > > > > > From:David Philip <[email protected]> > > >> > > > > > > Sent: Wed 27-Feb-2013 06:21 > > >> > > > > > > To: [email protected] > > >> > > > > > > Subject: Re: Nutch Incremental Crawl > > >> > > > > > > > > >> > > > > > > Hi all, > > >> > > > > > > > > >> > > > > > > Thank you very much for the replies. Very useful > > >> information to > > >> > > > > > > understand how incremental crawling can be achieved. > > >> > > > > > > > > >> > > > > > > Dear Markus: > > >> > > > > > > Can you please tell me how do I over ride this fetch > > interval > > >> , > > >> > > > incase > > >> > > > > > if I > > >> > > > > > > require to fetch the page before the time interval is > > passed? > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > Thanks very much > > >> > > > > > > - David > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma > > >> > > > > > > <[email protected]>wrote: > > >> > > > > > > > > >> > > > > > > > If you want records to be fetched at a fixed interval > its > > >> > easier > > >> > > to > > >> > > > > > inject > > >> > > > > > > > them with a fixed fetch interval. > > >> > > > > > > > > > >> > > > > > > > nutch.fixedFetchInterval=86400 > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > -----Original message----- > > >> > > > > > > > > From:kemical <[email protected]> > > >> > > > > > > > > Sent: Thu 14-Feb-2013 10:15 > > >> > > > > > > > > To: [email protected] > > >> > > > > > > > > Subject: Re: Nutch Incremental Crawl > > >> > > > > > > > > > > >> > > > > > > > > Hi David, > > >> > > > > > > > > > > >> > > > > > > > > You can also consider setting shorter fetch interval > > time > > >> > with > > >> > > > nutch > > >> > > > > > > > inject. > > >> > > > > > > > > This way you'll set higher score (so the url is always > > >> taken > > >> > in > > >> > > > > > priority > > >> > > > > > > > > when you generate a segment) and a fetch.interval of 1 > > >> day. > > >> > > > > > > > > > > >> > > > > > > > > If you have a case similar to me, you'll often want > some > > >> > > homepage > > >> > > > > > fetch > > >> > > > > > > > each > > >> > > > > > > > > day but not their inlinks. What you can do is inject > all > > >> your > > >> > > > seed > > >> > > > > > urls > > >> > > > > > > > > again (assuming those url are only homepages). > > >> > > > > > > > > > > >> > > > > > > > > #change nutch option so existing urls can be injected > > >> again > > >> > in > > >> > > > > > > > > conf/nutch-default.xml or conf/nutch-site.xml > > >> > > > > > > > > db.injector.update=true > > >> > > > > > > > > > > >> > > > > > > > > #Add metadata to update score/fetch interval > > >> > > > > > > > > #the following line will concat to each line of your > > seed > > >> > urls > > >> > > > files > > >> > > > > > with > > >> > > > > > > > > the new score / new interval > > >> > > > > > > > > perl -pi -e > > >> > > > > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000' > > >> > > > > > > > > [your_seed_url_dir]/* > > >> > > > > > > > > > > >> > > > > > > > > #run command > > >> > > > > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir] > > >> > > > > > > > > > > >> > > > > > > > > Now, the following crawl will take your urls in top > > >> priority > > >> > > and > > >> > > > > > crawl > > >> > > > > > > > them > > >> > > > > > > > > once a day. I've used my situation to illustrate the > > >> concept > > >> > > but > > >> > > > i > > >> > > > > > guess > > >> > > > > > > > you > > >> > > > > > > > > can tweek params to fit your needs. > > >> > > > > > > > > > > >> > > > > > > > > This way is useful when you want a regular fetch on > some > > >> > urls, > > >> > > if > > >> > > > > > it's > > >> > > > > > > > > occured rarely i guess freegen is the right choice. > > >> > > > > > > > > > > >> > > > > > > > > Best, > > >> > > > > > > > > Mike > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > -- > > >> > > > > > > > > View this message in context: > > >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > > > >> > > > >> > > > http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html > > >> > > > > > > > > Sent from the Nutch - User mailing list archive at > > >> > Nabble.com. > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > -- > > >> > Don't Grow Old, Grow Up... :-) > > >> > > > >> > > > > > > > > > > > > -- > > > Don't Grow Old, Grow Up... :-) > > > > > > > > > > > -- > > Don't Grow Old, Grow Up... :-) > > > -- Don't Grow Old, Grow Up... :-)

