date:20171204

RE: purging low-scoring urls

2017-12-04 Thread Yossi Tamari

Forgot to say: a urlfilter can't do that, since its input is just the URL, without any metadata such as the score. > -Original Message- > From: Yossi Tamari [mailto:yossi.tam...@pipl.com] > Sent: 04 December 2017 21:01 > To: user@nutch.apache.org; 'Michael Coffey' >

RE: purging low-scoring urls

2017-12-04 Thread Yossi Tamari

Hi Michael, I think one way you can do it is using `readdb -dump new_crawldb -format crawldb -expr "score>0.03" `. You would then need to use hdfs commands to replace the existing /current with newcrawl_db. Of course, I strongly recommend backing up the current crawldb before replacing it...

crawlcomplete

2017-12-04 Thread Yossi Tamari

Hi, I'm trying to understand some of the design decisions behind the crawlcomplete tool. I find the concept itself very useful, but there are a couple of behaviors that I don't understand: 1. URLs that resulted in redirect (even permanent) are counted as unfetched. That means that if I

purging low-scoring urls

2017-12-04 Thread Michael Coffey

Is it possible to purge low-scoring urls from the crawldb? My news crawl has many thousands of zero-scoring urls and also many thousands of urls with scores less than 0.03. These urls will never be fetched because they will never make it into the generator's topN by score. So, all they do is

RE: purging low-scoring urls

RE: purging low-scoring urls

crawlcomplete

purging low-scoring urls

4 matches

Site Navigation

Mail list logo

Footer information