Brion,

We are having to resort to crawling en.wikipedia.org while we await  
for regular dumps.
What is the minimum crawling delay we can get away with? I figure if we
have 1 second delay then we'd be able to crawl the 2+ million articles  
in a month.

I know crawling is discouraged but it seems a lot of parties still do  
so after looking at robots.txt
I have to assume that is how Google et al. is able to keep up to date.

Are their private data feeds?  I noticed a wg_enwiki dump listed.

Christian

On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:

> That would be great.  I second this notion whole heartedly.
>
>
> On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:
>
>> "Brion Vibber" <[email protected]> wrote in message
>> news:[email protected]...
>>> On 1/27/09 2:55 PM, Robert Rohde wrote:
>>>> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber<[email protected]>
>>>> wrote:
>>>>> On 1/27/09 2:35 PM, Thomas Dalton wrote:
>>>>>> The way I see it, what we need is to get a really powerful server
>>>>> Nope, it's a software architecture issue. We'll restart it with
>>>>> the new
>>>>> arch when it's ready to go.
>>>> The simplest solution is just to kill the current dump job if you
>>>> have
>>>> faith that a new architecture can be put in place in less than a
>>>> year.
>>>
>>> We'll probably do that.
>>>
>>> -- brion
>>
>> FWIW, I'll add my vote for aborting the current dump *now* if we  
>> don't
>> expect it ever to actually be finished, so we can at least get a
>> fresh dump
>> of the current pages.
>>
>> Russ
>>
>>
>>
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to