Issues with certain URLs not being fetched.

Mike Pountney Tue, 12 Oct 2010 03:10:42 -0700

Hi there,

Several URLs in my crawldb are not being fetched, and I'm baffled as to why.


It seems like the regex-urlfilter.txt is not blocking them, but as I understand 
it this filter stops URLs from getting /into/ the crawldb, and not used by the 
generate step for selection.

I keep a log of the fetch operation, and the missing URLs are never mentioned, 
so it appears that they are not being selected by the generate step.

Does anyone have any recommendations for how I can debug this further? Or what 
can prevent the generate step from consistently not selecting a URL?

An example is this readdb -dump fragment:

http://dom.semantico.net/blog/archives/2008/12/10/sql-downgrade/        
Version: 7
Status: 1 (db_unfetched)
Fetch time: Wed Oct 06 17:55:00 BST 2010
Modified time: Thu Jan 01 01:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 7200 seconds (0 days)
Score: 8.278514E-11
Signature: null
Metadata: 

Thanks,

Mike

Issues with certain URLs not being fetched.

Reply via email to