Hi - easiest method is to use the freegen tool. But if you really want 
homepages, not just domain roots, you can use the hostdb with freegen for it.

# Update the hostdb
bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb/

# Get list of homepages for each host
bin/nutch readhostdb crawl/hostdb/ output -dumpHomepages

Then use freegen.

Markus
 
 
-----Original message-----
> From:harsh <[email protected]>
> Sent: Wednesday 24th February 2016 12:49
> To: [email protected]
> Subject: recrawling of specific URLS
> 
> Hi All
> 
> Nutch is made to update ALL the URLs after a certain point of time.
> But I want to recrawl only the home page of seed URL so that i could get 
> new link from the home page to crawl.
> Currently I am using  the bug "Inject command re-inject seed URLS." for 
> recrawling my seed URLs.But this is not the standard way.
> Please give a suggestion.I have read articles/discussions on 
> re-crawling.But could not find the solution.
> Lewis,Tejas Please help!!!!!
> 
> Thanks
> 

Reply via email to