Thanks Markus.
I can not use freegen as this tool is not available via REST api.
With the combination of -adddays and -expr options of generator I achieved my
requirement.
Here is what I did:
1. inject the urls with some metadata say pageId=<unique value>
Seed file contains the below entry:
http://localhost:9090/nutchsite/html/page1.html pageId=<unique vlaue>
2. now issue the generate command with the -adddays(to make all the urls to be
due for fetch) and -expr(to filter out the urls) options to select only the
urls to be fetched again as below:
$ bin/nutch generate examplesite/crawldb examplesite/segments -expr
"(pageId == '<unique value>')" -adddays 30
Please comment if you see any issues with this approach.
Thanks
Sujan
-----Original Message-----
From: Markus Jelsma [mailto:[email protected]]
Sent: Thursday, October 06, 2016 7:32 PM
To: [email protected]
Subject: RE: nutch 1.12 How can I force a URL to get re-indexed
Hi
You can use -adddays N in the generator job to fool it, or just use a lower
interval. Or, use the freegen tool to immediately crawl a set of URL's.
Markus
-----Original message-----
> From:Sujan Suppala <[email protected]>
> Sent: Thursday 6th October 2016 15:56
> To: [email protected]
> Subject: nutch 1.12 How can I force a URL to get re-indexed
>
> Hi,
>
> By default the nutch is fetching the URL based on the already set next fetch
> interval(30 days), suppose if the page is updated before this interval (30
> days) how can I force to re-index?
>
> How can I just 're-inject' the URLs to set the next fetch date to
> 'immediately'?
>
> Fyi, I am using the nutch rest api client for to index the URLs.
>
> Thanks
> Sujan
>