Hello list,

Has this issue been solved?

I am having identical problems with URLs not being fetched which have been 
specified in urls list and have been allowed by regex-urlfilter.txt. My log 
does not mention anything regarding URLs which are consistently being ignored 
by crawling process.

These are a selection of URLs which are not being fetched

http://www.scotland.gov.uk/Home

http://www.scotland.gov.uk/Topics

Thank you

Lewis




-----Original Message-----
From: Sonal Goyal [mailto:[email protected]]
Sent: 12 October 2010 11:17
To: [email protected]
Subject: Re: Issues with certain URLs not being fetched.

Mike, the fetch will be based on the score of the url. Higher scoring urls
are selected first.

Thanks and Regards,
Sonal

Sonal Goyal | Founder and CEO | Nube Technologies LLP
http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal





On Tue, Oct 12, 2010 at 3:40 PM, Mike Pountney
<[email protected]>wrote:

>
> Hi there,
>
> Several URLs in my crawldb are not being fetched, and I'm baffled as to
> why.
>
> It seems like the regex-urlfilter.txt is not blocking them, but as I
> understand it this filter stops URLs from getting /into/ the crawldb, and
> not used by the generate step for selection.
>
> I keep a log of the fetch operation, and the missing URLs are never
> mentioned, so it appears that they are not being selected by the generate
> step.
>
> Does anyone have any recommendations for how I can debug this further? Or
> what can prevent the generate step from consistently not selecting a URL?
>
> An example is this readdb -dump fragment:
>
> http://dom.semantico.net/blog/archives/2008/12/10/sql-downgrade/
>  Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Oct 06 17:55:00 BST 2010
> Modified time: Thu Jan 01 01:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 7200 seconds (0 days)
> Score: 8.278514E-11
> Signature: null
> Metadata:
>
> Thanks,
>
> Mike
>
>

Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 
2009 and Herald Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Reply via email to