Sonal was correct in my case (Thanks!), though I was sure something else was 
awry.

Running a generate --topN 10000 to ensure all documents in the crawldb (of 9600 
urls) were selected did indeed pick up the missing URLs.

Now I just need to work out why they were scored poorly - they were quite 
sensible documents.


On 12 Oct 2010, at 15:11, McGibbney, Lewis John wrote:

> Hello list,
> 
> Has this issue been solved?
> 
> I am having identical problems with URLs not being fetched which have been 
> specified in urls list and have been allowed by regex-urlfilter.txt. My log 
> does not mention anything regarding URLs which are consistently being ignored 
> by crawling process.
> 
> These are a selection of URLs which are not being fetched
> 
> http://www.scotland.gov.uk/Home
> 
> http://www.scotland.gov.uk/Topics
> 
> Thank you
> 
> Lewis
> 
> 
> 
> 
> -----Original Message-----
> From: Sonal Goyal [mailto:[email protected]]
> Sent: 12 October 2010 11:17
> To: [email protected]
> Subject: Re: Issues with certain URLs not being fetched.
> 
> Mike, the fetch will be based on the score of the url. Higher scoring urls
> are selected first.
> 
> Thanks and Regards,
> Sonal
> 
> Sonal Goyal | Founder and CEO | Nube Technologies LLP
> http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal
> 
> 
> 
> 
> 
> On Tue, Oct 12, 2010 at 3:40 PM, Mike Pountney
> <[email protected]>wrote:
> 
>> 
>> Hi there,
>> 
>> Several URLs in my crawldb are not being fetched, and I'm baffled as to
>> why.
>> 
>> It seems like the regex-urlfilter.txt is not blocking them, but as I
>> understand it this filter stops URLs from getting /into/ the crawldb, and
>> not used by the generate step for selection.
>> 
>> I keep a log of the fetch operation, and the missing URLs are never
>> mentioned, so it appears that they are not being selected by the generate
>> step.
>> 
>> Does anyone have any recommendations for how I can debug this further? Or
>> what can prevent the generate step from consistently not selecting a URL?
>> 
>> An example is this readdb -dump fragment:
>> 
>> http://dom.semantico.net/blog/archives/2008/12/10/sql-downgrade/
>> Version: 7
>> Status: 1 (db_unfetched)
>> Fetch time: Wed Oct 06 17:55:00 BST 2010
>> Modified time: Thu Jan 01 01:00:00 GMT 1970
>> Retries since fetch: 0
>> Retry interval: 7200 seconds (0 days)
>> Score: 8.278514E-11
>> Signature: null
>> Metadata:
>> 
>> Thanks,
>> 
>> Mike
>> 
>> 
> 
> Email has been scanned for viruses by Altman Technologies' email management 
> service - www.altman.co.uk/emailsystems
> 
> Glasgow Caledonian University is a registered Scottish charity, number 
> SC021474
> 
> Winner: Times Higher Education's Widening Participation Initiative of the 
> Year 2009 and Herald Society's Education Initiative of the Year 2009
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

--
Mike Pountney

Information Systems Manager, Semantico Limited
<mailto:[email protected]> <tel:+44 1273 358 209>
Registered in England and Wales no. 03841410, VAT no. GB-744614334.
Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK.

Check out all our latest news and thinking on our blog;
- http://blogs.semantico.com/discovery-blog/

Follow Semantico on Twitter;
- http://twitter.com/semantico

Reply via email to