Mike, the fetch will be based on the score of the url. Higher scoring urls are selected first.
Thanks and Regards, Sonal Sonal Goyal | Founder and CEO | Nube Technologies LLP http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal On Tue, Oct 12, 2010 at 3:40 PM, Mike Pountney <[email protected]>wrote: > > Hi there, > > Several URLs in my crawldb are not being fetched, and I'm baffled as to > why. > > It seems like the regex-urlfilter.txt is not blocking them, but as I > understand it this filter stops URLs from getting /into/ the crawldb, and > not used by the generate step for selection. > > I keep a log of the fetch operation, and the missing URLs are never > mentioned, so it appears that they are not being selected by the generate > step. > > Does anyone have any recommendations for how I can debug this further? Or > what can prevent the generate step from consistently not selecting a URL? > > An example is this readdb -dump fragment: > > http://dom.semantico.net/blog/archives/2008/12/10/sql-downgrade/ > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Wed Oct 06 17:55:00 BST 2010 > Modified time: Thu Jan 01 01:00:00 GMT 1970 > Retries since fetch: 0 > Retry interval: 7200 seconds (0 days) > Score: 8.278514E-11 > Signature: null > Metadata: > > Thanks, > > Mike > >

