Re: Nutch Incremental Crawl

Sebastian Nagel Mon, 04 Feb 2013 13:01:19 -0800

Hi David,

the first steps are right but maybe it's easier to run the Java classes via 
bin/nutch:


bin/nutch freegen  urls2/  freegen_segments/
# generated: freegen_segments/123
bin/nutch fetch  freegen_segments/123
bin/nutch parse  freegen_segments/123  (if fetcher.parse == false)
bin/nutch updatedb  ...  freegen_segments/123 ...
bin/nutch linkdb ...  freegen_segments/123  ...

Have a look at the command-line help for more details on
the arguments, there is also a wiki page.

But now you don't run the crawl command but just:

bin/nutch solrindex ... freegen_segments/123

The central point is to process only the segment you just created
via FreeGenerator. Either you place it in a new directory, or store it in a 
variable.

Cheers,
Sebastian

On 02/04/2013 10:38 AM, David Philip wrote:
> Hi Sebastian,
> 
>    Thank you for the reply. Executed the following steps, please correct me
> if I am wrong. I do not see the changes updated.
> Run:
> 
>    - org.apache.nutch.tools.FreeGenerator  *arguments* :urls2
>    crawl/segments [urls2/seed.txt  - url of the page that was modified]
>    - org.apache.nutch.fetcher.Fetcher         *arguments* :crawl/segments/*
>    - org.apache.nutch.parse.ParseSegment  *arguments*: crawl/segments/*
>    - org.apache.nutch.crawl.CrawlDb*           arguments :
>    crawlDBOld/crawldb crawl/segments/*
> 
>    After this I ran crawl command :*
>    - *org.apache.nutch.crawl.Crawl        arguments: urls
>    -dir crawlDB_Old -depth 10 -solr http://localhost:8080/solrnutch*
> 
> 
>  But, I don't see the changes modified for that url in the solr indexes.
> Can you please tell me which step was skipped or not executed properly? I
> wanted to see the changes reflected in solr indexes for that document.
> 
> 
> Thanks - David
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Sat, Feb 2, 2013 at 5:27 AM, Sebastian Nagel
> <wastl.na...@googlemail.com>wrote:
> 
>> Hi David,
>>
>>> So even If there is any modification made on a fetched
>>> page before this interval and the crawl job is run, it will still not be
>>> re-fetched/updated unless this interval is crossed.
>> Yes. That's correct.
>>
>>> is there any way to do immediate update?
>> Yes, provided that you know which documents have been changed, of course.
>> Have a look at o.a.n.tools.FreeGenerator (Nutch 1.x). Start a segment
>> for a list of URLs, fetch and parse it, and update CrawlDb.
>>
>> Sebastian
>>
>>
>> On 02/01/2013 10:32 AM, David Philip wrote:
>>> Hi Team,
>>>
>>>    I have a question on nutch re-crawl. Please let me know if there are
>> any
>>> links explaining this.
>>> Q is, Does nutch re-crawl the sites/pages by checking the last date
>>> modified?
>>>
>>> I understand from this
>>> <http://nutch.wordpress.com/category/recrawling/>website that at the
>>> time of fresh crawl, the
>>> *db.default.fetch.interval* parameter will be set to each URL. This
>>> property will decide whether or not the page should be re-fetched at
>>> consequent crawls. So even If there is any modification made on a fetched
>>> page before this interval and the crawl job is run, it will still not be
>>> re-fetched/updated unless this interval is crossed. If above is the case,
>>> is there any way to do immediate update?
>>>
>>> *
>>> *
>>> *Example case: *
>>> Say We crawled a blog-site that had x blogs , the fetch interval was set
>> to
>>> 7 days.
>>> But the blog-user added new blog and also modified an existing blog
>> within
>>> 2days. So in such case when we want to immediately update our crawl data
>>> and  include newly added changes[not waiting for 7days], what need to be
>>> done?  [nutch merge concept? ]
>>>
>>>
>>>
>>>
>>> Awaiting reply,
>>> Thanks -David
>>>
>>
>>
>

Re: Nutch Incremental Crawl

Reply via email to