Forgot to say: a urlfilter can't do that, since its input is just the URL, 
without any metadata such as the score.

> -----Original Message-----
> From: Yossi Tamari [mailto:yossi.tam...@pipl.com]
> Sent: 04 December 2017 21:01
> To: user@nutch.apache.org; 'Michael Coffey' <mcof...@yahoo.com>
> Subject: RE: purging low-scoring urls
> 
> Hi Michael,
> 
> I think one way you can do it is using `readdb <crawldb> -dump new_crawldb -
> format crawldb -expr "score>0.03" `.
> You would then need to use hdfs commands to replace the existing
> <crawldb>/current with newcrawl_db.
> Of course, I strongly recommend backing up the current crawldb before
> replacing it...
> 
>       Yossi.
> 
> > -----Original Message-----
> > From: Michael Coffey [mailto:mcof...@yahoo.com.INVALID]
> > Sent: 04 December 2017 20:38
> > To: User <user@nutch.apache.org>
> > Subject: purging low-scoring urls
> >
> > Is it possible to purge low-scoring urls from the crawldb? My news crawl has
> > many thousands of zero-scoring urls and also many thousands of urls with
> > scores less than 0.03. These urls will never be fetched because they will 
> > never
> > make it into the generator's topN by score. So, all they do is make the 
> > process
> > slower.
> >
> > It seems like something an urlfilter could do, but I have not found any
> > documentation for any urlfilter that does it.


Reply via email to