Hi Michael,
I think one way you can do it is using `readdb <crawldb> -dump new_crawldb
-format crawldb -expr "score>0.03" `.
You would then need to use hdfs commands to replace the existing
<crawldb>/current with newcrawl_db.
Of course, I strongly recommend backing up the current crawldb before replacing
it...
Yossi.
> -----Original Message-----
> From: Michael Coffey [mailto:[email protected]]
> Sent: 04 December 2017 20:38
> To: User <[email protected]>
> Subject: purging low-scoring urls
>
> Is it possible to purge low-scoring urls from the crawldb? My news crawl has
> many thousands of zero-scoring urls and also many thousands of urls with
> scores less than 0.03. These urls will never be fetched because they will
> never
> make it into the generator's topN by score. So, all they do is make the
> process
> slower.
>
> It seems like something an urlfilter could do, but I have not found any
> documentation for any urlfilter that does it.