Brion, We are having to resort to crawling en.wikipedia.org while we await for regular dumps. What is the minimum crawling delay we can get away with? I figure if we have 1 second delay then we'd be able to crawl the 2+ million articles in a month.
I know crawling is discouraged but it seems a lot of parties still do so after looking at robots.txt I have to assume that is how Google et al. is able to keep up to date. Are their private data feeds? I noticed a wg_enwiki dump listed. Christian On Jan 28, 2009, at 10:47 AM, Christian Storm wrote: > That would be great. I second this notion whole heartedly. > > > On Jan 28, 2009, at 7:34 AM, Russell Blau wrote: > >> "Brion Vibber" <[email protected]> wrote in message >> news:[email protected]... >>> On 1/27/09 2:55 PM, Robert Rohde wrote: >>>> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber<[email protected]> >>>> wrote: >>>>> On 1/27/09 2:35 PM, Thomas Dalton wrote: >>>>>> The way I see it, what we need is to get a really powerful server >>>>> Nope, it's a software architecture issue. We'll restart it with >>>>> the new >>>>> arch when it's ready to go. >>>> The simplest solution is just to kill the current dump job if you >>>> have >>>> faith that a new architecture can be put in place in less than a >>>> year. >>> >>> We'll probably do that. >>> >>> -- brion >> >> FWIW, I'll add my vote for aborting the current dump *now* if we >> don't >> expect it ever to actually be finished, so we can at least get a >> fresh dump >> of the current pages. >> >> Russ >> >> >> >> >> _______________________________________________ >> Wikitech-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
