Sonal was correct in my case (Thanks!), though I was sure something else was awry.
Running a generate --topN 10000 to ensure all documents in the crawldb (of 9600 urls) were selected did indeed pick up the missing URLs. Now I just need to work out why they were scored poorly - they were quite sensible documents. On 12 Oct 2010, at 15:11, McGibbney, Lewis John wrote: > Hello list, > > Has this issue been solved? > > I am having identical problems with URLs not being fetched which have been > specified in urls list and have been allowed by regex-urlfilter.txt. My log > does not mention anything regarding URLs which are consistently being ignored > by crawling process. > > These are a selection of URLs which are not being fetched > > http://www.scotland.gov.uk/Home > > http://www.scotland.gov.uk/Topics > > Thank you > > Lewis > > > > > -----Original Message----- > From: Sonal Goyal [mailto:[email protected]] > Sent: 12 October 2010 11:17 > To: [email protected] > Subject: Re: Issues with certain URLs not being fetched. > > Mike, the fetch will be based on the score of the url. Higher scoring urls > are selected first. > > Thanks and Regards, > Sonal > > Sonal Goyal | Founder and CEO | Nube Technologies LLP > http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal > > > > > > On Tue, Oct 12, 2010 at 3:40 PM, Mike Pountney > <[email protected]>wrote: > >> >> Hi there, >> >> Several URLs in my crawldb are not being fetched, and I'm baffled as to >> why. >> >> It seems like the regex-urlfilter.txt is not blocking them, but as I >> understand it this filter stops URLs from getting /into/ the crawldb, and >> not used by the generate step for selection. >> >> I keep a log of the fetch operation, and the missing URLs are never >> mentioned, so it appears that they are not being selected by the generate >> step. >> >> Does anyone have any recommendations for how I can debug this further? Or >> what can prevent the generate step from consistently not selecting a URL? >> >> An example is this readdb -dump fragment: >> >> http://dom.semantico.net/blog/archives/2008/12/10/sql-downgrade/ >> Version: 7 >> Status: 1 (db_unfetched) >> Fetch time: Wed Oct 06 17:55:00 BST 2010 >> Modified time: Thu Jan 01 01:00:00 GMT 1970 >> Retries since fetch: 0 >> Retry interval: 7200 seconds (0 days) >> Score: 8.278514E-11 >> Signature: null >> Metadata: >> >> Thanks, >> >> Mike >> >> > > Email has been scanned for viruses by Altman Technologies' email management > service - www.altman.co.uk/emailsystems > > Glasgow Caledonian University is a registered Scottish charity, number > SC021474 > > Winner: Times Higher Education's Widening Participation Initiative of the > Year 2009 and Herald Society's Education Initiative of the Year 2009 > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html -- Mike Pountney Information Systems Manager, Semantico Limited <mailto:[email protected]> <tel:+44 1273 358 209> Registered in England and Wales no. 03841410, VAT no. GB-744614334. Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK. Check out all our latest news and thinking on our blog; - http://blogs.semantico.com/discovery-blog/ Follow Semantico on Twitter; - http://twitter.com/semantico

