RE: nutch 1.12 How can I force a URL to get re-indexed

Sujan Suppala Fri, 07 Oct 2016 04:45:08 -0700

Thanks Markus.

I can not use freegen as this tool is not available via REST api.


With the combination of -adddays and -expr options of generator I achieved my 
requirement. 
Here is what I did:
1. inject the urls with some metadata say pageId=<unique value>
        Seed file contains the below entry:
        http://localhost:9090/nutchsite/html/page1.html pageId=<unique vlaue>

2. now issue the generate command with the -adddays(to make all the urls to be 
due for fetch) and -expr(to filter out the urls) options to select only the 
urls to be fetched again as below:
        $ bin/nutch generate examplesite/crawldb examplesite/segments -expr 
"(pageId == '<unique value>')" -adddays 30
        
Please comment if you see any issues with this approach.

Thanks
Sujan

-----Original Message-----
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Thursday, October 06, 2016 7:32 PM
To: user@nutch.apache.org
Subject: RE: nutch 1.12 How can I force a URL to get re-indexed

Hi

You can use -adddays N in the generator job to fool it, or just use a lower 
interval. Or, use the freegen tool to immediately crawl a set of URL's.

Markus

 
 
-----Original message-----
> From:Sujan Suppala <ssupp...@opentext.com>
> Sent: Thursday 6th October 2016 15:56
> To: user@nutch.apache.org
> Subject: nutch 1.12 How can I force a URL to get re-indexed
> 
> Hi,
> 
> By default the nutch is fetching the URL based on the already set next fetch 
> interval(30 days), suppose if the page is updated before this interval (30 
> days) how can I force to re-index?
> 
> How can I just 're-inject' the URLs to set the next fetch date to 
> 'immediately'?
> 
> Fyi, I am using the nutch rest api client for to index the URLs.
> 
> Thanks
> Sujan
>

RE: nutch 1.12 How can I force a URL to get re-indexed

Reply via email to