Hi there, Several URLs in my crawldb are not being fetched, and I'm baffled as to why.
It seems like the regex-urlfilter.txt is not blocking them, but as I understand it this filter stops URLs from getting /into/ the crawldb, and not used by the generate step for selection. I keep a log of the fetch operation, and the missing URLs are never mentioned, so it appears that they are not being selected by the generate step. Does anyone have any recommendations for how I can debug this further? Or what can prevent the generate step from consistently not selecting a URL? An example is this readdb -dump fragment: http://dom.semantico.net/blog/archives/2008/12/10/sql-downgrade/ Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Oct 06 17:55:00 BST 2010 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 7200 seconds (0 days) Score: 8.278514E-11 Signature: null Metadata: Thanks, Mike

