Hello list, Has this issue been solved?
I am having identical problems with URLs not being fetched which have been specified in urls list and have been allowed by regex-urlfilter.txt. My log does not mention anything regarding URLs which are consistently being ignored by crawling process. These are a selection of URLs which are not being fetched http://www.scotland.gov.uk/Home http://www.scotland.gov.uk/Topics Thank you Lewis -----Original Message----- From: Sonal Goyal [mailto:[email protected]] Sent: 12 October 2010 11:17 To: [email protected] Subject: Re: Issues with certain URLs not being fetched. Mike, the fetch will be based on the score of the url. Higher scoring urls are selected first. Thanks and Regards, Sonal Sonal Goyal | Founder and CEO | Nube Technologies LLP http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal On Tue, Oct 12, 2010 at 3:40 PM, Mike Pountney <[email protected]>wrote: > > Hi there, > > Several URLs in my crawldb are not being fetched, and I'm baffled as to > why. > > It seems like the regex-urlfilter.txt is not blocking them, but as I > understand it this filter stops URLs from getting /into/ the crawldb, and > not used by the generate step for selection. > > I keep a log of the fetch operation, and the missing URLs are never > mentioned, so it appears that they are not being selected by the generate > step. > > Does anyone have any recommendations for how I can debug this further? Or > what can prevent the generate step from consistently not selecting a URL? > > An example is this readdb -dump fragment: > > http://dom.semantico.net/blog/archives/2008/12/10/sql-downgrade/ > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Wed Oct 06 17:55:00 BST 2010 > Modified time: Thu Jan 01 01:00:00 GMT 1970 > Retries since fetch: 0 > Retry interval: 7200 seconds (0 days) > Score: 8.278514E-11 > Signature: null > Metadata: > > Thanks, > > Mike > > Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education's Widening Participation Initiative of the Year 2009 and Herald Society's Education Initiative of the Year 2009 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

