Hi Markus,
This is a tricky one, I have personally had terrible headaches with
similar problems where an update to a piece of legislation completely
changes it's URL, which makes the task of provenance hellishly
complex... We addressed this by ensuring that legislation URI's stay
consistent regardless of changes to textual content within any given
artifact.

W.r.t your specific problem, it is really outwith your control how and
when the URL's change (as you've already described) and for that I am
struggling to provide you with any reasonable input... sorry.
Lewis

On Thu, Jul 5, 2012 at 8:51 AM, Markus Jelsma
<[email protected]> wrote:
> Any ideas?
>
>
>
> -----Original message-----
>> From:Markus Jelsma <[email protected]>
>> Sent: Mon 02-Jul-2012 23:05
>> To: [email protected]
>> Subject: Adaptive scheduling, but different
>>
>> Hi,
>>
>> We use an adaptive scheduler for our crawl, this works fine for most cases 
>> but a specific type of page is crawled more often than it should. These are 
>> usually news or article archives such as news/archive/12345. Most websites 
>> generate these pages dynamically. The problem is that whenever a new item is 
>> posted, all news/archive/* pages become modified, every article or item 
>> shifts one position and changes thousands of URL's.
>>
>> The problem of adaptive scheduling for these pages should be obvious by now. 
>> I have given it some thought the past few weeks but i haven't figured out a 
>> generic solution just yet so any advice or out-of-the-box ideas or very much 
>> appreciated!
>>
>> Thanks
>> Markus
>>



-- 
Lewis

Reply via email to