Hi David, the first steps are right but maybe it's easier to run the Java classes via bin/nutch:
bin/nutch freegen urls2/ freegen_segments/ # generated: freegen_segments/123 bin/nutch fetch freegen_segments/123 bin/nutch parse freegen_segments/123 (if fetcher.parse == false) bin/nutch updatedb ... freegen_segments/123 ... bin/nutch linkdb ... freegen_segments/123 ... Have a look at the command-line help for more details on the arguments, there is also a wiki page. But now you don't run the crawl command but just: bin/nutch solrindex ... freegen_segments/123 The central point is to process only the segment you just created via FreeGenerator. Either you place it in a new directory, or store it in a variable. Cheers, Sebastian On 02/04/2013 10:38 AM, David Philip wrote: > Hi Sebastian, > > Thank you for the reply. Executed the following steps, please correct me > if I am wrong. I do not see the changes updated. > Run: > > - org.apache.nutch.tools.FreeGenerator *arguments* :urls2 > crawl/segments [urls2/seed.txt - url of the page that was modified] > - org.apache.nutch.fetcher.Fetcher *arguments* :crawl/segments/* > - org.apache.nutch.parse.ParseSegment *arguments*: crawl/segments/* > - org.apache.nutch.crawl.CrawlDb* arguments : > crawlDBOld/crawldb crawl/segments/* > > After this I ran crawl command :* > - *org.apache.nutch.crawl.Crawl arguments: urls > -dir crawlDB_Old -depth 10 -solr http://localhost:8080/solrnutch* > > > But, I don't see the changes modified for that url in the solr indexes. > Can you please tell me which step was skipped or not executed properly? I > wanted to see the changes reflected in solr indexes for that document. > > > Thanks - David > > > > > > > > > > On Sat, Feb 2, 2013 at 5:27 AM, Sebastian Nagel > <wastl.na...@googlemail.com>wrote: > >> Hi David, >> >>> So even If there is any modification made on a fetched >>> page before this interval and the crawl job is run, it will still not be >>> re-fetched/updated unless this interval is crossed. >> Yes. That's correct. >> >>> is there any way to do immediate update? >> Yes, provided that you know which documents have been changed, of course. >> Have a look at o.a.n.tools.FreeGenerator (Nutch 1.x). Start a segment >> for a list of URLs, fetch and parse it, and update CrawlDb. >> >> Sebastian >> >> >> On 02/01/2013 10:32 AM, David Philip wrote: >>> Hi Team, >>> >>> I have a question on nutch re-crawl. Please let me know if there are >> any >>> links explaining this. >>> Q is, Does nutch re-crawl the sites/pages by checking the last date >>> modified? >>> >>> I understand from this >>> <http://nutch.wordpress.com/category/recrawling/>website that at the >>> time of fresh crawl, the >>> *db.default.fetch.interval* parameter will be set to each URL. This >>> property will decide whether or not the page should be re-fetched at >>> consequent crawls. So even If there is any modification made on a fetched >>> page before this interval and the crawl job is run, it will still not be >>> re-fetched/updated unless this interval is crossed. If above is the case, >>> is there any way to do immediate update? >>> >>> * >>> * >>> *Example case: * >>> Say We crawled a blog-site that had x blogs , the fetch interval was set >> to >>> 7 days. >>> But the blog-user added new blog and also modified an existing blog >> within >>> 2days. So in such case when we want to immediately update our crawl data >>> and include newly added changes[not waiting for 7days], what need to be >>> done? [nutch merge concept? ] >>> >>> >>> >>> >>> Awaiting reply, >>> Thanks -David >>> >> >> >