Thanks.
I will try to patch CrawlDBFilter for both problems.
On 10/14/2011 05:35 PM, Markus Jelsma wrote:
On Friday 14 October 2011 15:30:25 Sergey A Volkov wrote:
Thanks for your quick reply.
I will try to use scoreupdater next time=)
Keep in mind that it relies on the WebGraph program. Another quick fix would
be to patch CrawlDBFilter to reset score based on the presence of some
configuration setting.
Unfortunately -addDays would not work for me because I want to refetch
only specified domains, not all db (my first question was not correct).
Another problem with -addDays and FetchSchedule is that I have to use
generate.topN lower than size of part for refetch (there are some time
restrictions for index update)
, so i can't determine when to stop using addDays
If you only want to generate fetch lists for specific domains you can use a
custom domain URL filter with the generator.
Take care of using a filter for a generator with DB updating as you'll loose
all filtered URL's then.
On Fri 14 Oct 2011 04:52:33 PM MSK, Markus Jelsma wrote:
There are no tools for resetting the score but it would not be hard to
modify an existing tool for that e.g. WebGraph's scoreupdater tool. You
can force refetch by using the -addDays switch with the generator tool.
It'll add numDays to the current time to generate records that are not
yet due for fetch.
On Friday 14 October 2011 14:48:47 Sergey A Volkov wrote:
Hi!
Is there any good way to modify all crawldb records? (e.g. drop score or
force refetch).
I'm using now nutch 1.2 and as I see the only way to do this is writing
own MapReduce task for every modification or changing CrawlDb updater
and writing own extension point.
Sergey Volkov.