Re: Nutch Incremental Crawl

David Philip Mon, 04 Feb 2013 01:39:05 -0800

Hi Sebastian,

   Thank you for the reply. Executed the following steps, please correct me
if I am wrong. I do not see the changes updated.
Run:


   - org.apache.nutch.tools.FreeGenerator  *arguments* :urls2
   crawl/segments [urls2/seed.txt  - url of the page that was modified]
   - org.apache.nutch.fetcher.Fetcher         *arguments* :crawl/segments/*
   - org.apache.nutch.parse.ParseSegment  *arguments*: crawl/segments/*
   - org.apache.nutch.crawl.CrawlDb*           arguments :
   crawlDBOld/crawldb crawl/segments/*

   After this I ran crawl command :*
   - *org.apache.nutch.crawl.Crawl        arguments: urls
   -dir crawlDB_Old -depth 10 -solr http://localhost:8080/solrnutch*


 But, I don't see the changes modified for that url in the solr indexes.
Can you please tell me which step was skipped or not executed properly? I
wanted to see the changes reflected in solr indexes for that document.


Thanks - David









On Sat, Feb 2, 2013 at 5:27 AM, Sebastian Nagel
<[email protected]>wrote:

> Hi David,
>
> > So even If there is any modification made on a fetched
> > page before this interval and the crawl job is run, it will still not be
> > re-fetched/updated unless this interval is crossed.
> Yes. That's correct.
>
> > is there any way to do immediate update?
> Yes, provided that you know which documents have been changed, of course.
> Have a look at o.a.n.tools.FreeGenerator (Nutch 1.x). Start a segment
> for a list of URLs, fetch and parse it, and update CrawlDb.
>
> Sebastian
>
>
> On 02/01/2013 10:32 AM, David Philip wrote:
> > Hi Team,
> >
> >    I have a question on nutch re-crawl. Please let me know if there are
> any
> > links explaining this.
> > Q is, Does nutch re-crawl the sites/pages by checking the last date
> > modified?
> >
> > I understand from this
> > <http://nutch.wordpress.com/category/recrawling/>website that at the
> > time of fresh crawl, the
> > *db.default.fetch.interval* parameter will be set to each URL. This
> > property will decide whether or not the page should be re-fetched at
> > consequent crawls. So even If there is any modification made on a fetched
> > page before this interval and the crawl job is run, it will still not be
> > re-fetched/updated unless this interval is crossed. If above is the case,
> > is there any way to do immediate update?
> >
> > *
> > *
> > *Example case: *
> > Say We crawled a blog-site that had x blogs , the fetch interval was set
> to
> > 7 days.
> > But the blog-user added new blog and also modified an existing blog
> within
> > 2days. So in such case when we want to immediately update our crawl data
> > and  include newly added changes[not waiting for 7days], what need to be
> > done?  [nutch merge concept? ]
> >
> >
> >
> >
> > Awaiting reply,
> > Thanks -David
> >
>
>

Re: Nutch Incremental Crawl

Reply via email to