Re: Issues with certain URLs not being fetched.

Sonal Goyal Tue, 12 Oct 2010 03:17:52 -0700

Mike, the fetch will be based on the score of the url. Higher scoring urls
are selected first.


Thanks and Regards,
Sonal

Sonal Goyal | Founder and CEO | Nube Technologies LLP
http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal





On Tue, Oct 12, 2010 at 3:40 PM, Mike Pountney
<[email protected]>wrote:

>
> Hi there,
>
> Several URLs in my crawldb are not being fetched, and I'm baffled as to
> why.
>
> It seems like the regex-urlfilter.txt is not blocking them, but as I
> understand it this filter stops URLs from getting /into/ the crawldb, and
> not used by the generate step for selection.
>
> I keep a log of the fetch operation, and the missing URLs are never
> mentioned, so it appears that they are not being selected by the generate
> step.
>
> Does anyone have any recommendations for how I can debug this further? Or
> what can prevent the generate step from consistently not selecting a URL?
>
> An example is this readdb -dump fragment:
>
> http://dom.semantico.net/blog/archives/2008/12/10/sql-downgrade/
>  Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Oct 06 17:55:00 BST 2010
> Modified time: Thu Jan 01 01:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 7200 seconds (0 days)
> Score: 8.278514E-11
> Signature: null
> Metadata:
>
> Thanks,
>
> Mike
>
>

Re: Issues with certain URLs not being fetched.

Reply via email to