Forgot to say: a urlfilter can't do that, since its input is just the URL, without any metadata such as the score.
> -----Original Message----- > From: Yossi Tamari [mailto:yossi.tam...@pipl.com] > Sent: 04 December 2017 21:01 > To: user@nutch.apache.org; 'Michael Coffey' <mcof...@yahoo.com> > Subject: RE: purging low-scoring urls > > Hi Michael, > > I think one way you can do it is using `readdb <crawldb> -dump new_crawldb - > format crawldb -expr "score>0.03" `. > You would then need to use hdfs commands to replace the existing > <crawldb>/current with newcrawl_db. > Of course, I strongly recommend backing up the current crawldb before > replacing it... > > Yossi. > > > -----Original Message----- > > From: Michael Coffey [mailto:mcof...@yahoo.com.INVALID] > > Sent: 04 December 2017 20:38 > > To: User <user@nutch.apache.org> > > Subject: purging low-scoring urls > > > > Is it possible to purge low-scoring urls from the crawldb? My news crawl has > > many thousands of zero-scoring urls and also many thousands of urls with > > scores less than 0.03. These urls will never be fetched because they will > > never > > make it into the generator's topN by score. So, all they do is make the > > process > > slower. > > > > It seems like something an urlfilter could do, but I have not found any > > documentation for any urlfilter that does it.