I copy the sugar list, since having an updated english wikipedia slice will be awesome, and others may want to get involved.
SJ ps - The 2% stat referenced below is that Andrew Cates finds that at any given moment there is a 2% chance of some sort of vandalism/error/flaw in an article (and a 0.5% chance of same when looking at the last trusted editor's contribs) making it helpful to have specific revisionids for articles in a snapshot. On Sun, Sep 7, 2008 at 7:11 PM, Samuel Klein <[EMAIL PROTECTED]> wrote: > Where is the code for this? Lede-detection code is a priority for me, > and I'd like to work on it. It should be easy to sense the start of > the first H2 and drop the rest of the article. > > Is there some way to estimate the size impact on the whole of adding > one template (given how often it is referenced)? If we could rank > templates by their footprint, it would be easier to "fill up" a space > allocation for them, as we do for images. > > SJ > > On Sun, Sep 7, 2008 at 7:02 PM, Chris Ball <[EMAIL PROTECTED]> wrote: >> Hi SJ, >> >> > To Andrew -- thank you. The 2% vandalism stat is very valuable! >> > CJB, would it be possible to grab revision ids from this page, >> > wherever there is a simple newline/title/oldid= ? >> >> Possible, yeah, but I'm not sure it'll be the best use of the time I >> have remaining to work on this once the work-week starts up again and I >> get back to blockers for the release. We'd have to switch over from the >> "current versions" archive to the "all versions" archive, and then write >> scripts to create a new archive with the versions we want. >> >> > Other replies inline: I am working on an article list here: >> >> > http://en.wikipedia.org/wiki/User:Sj/en-g1g1#D >> >> > Agreed. It seems that removing extraneous references to Harry >> > Potter frees up another thousand articles or so... >> >> Can't tell whether this was humor. ;-) >> >> > en:wp articles tend to grow without shrinking. Like you, I'm >> > worried about not having enough articles to make a valuable >> > reference work, especially in the sense of having a solid network >> > of internal links. I also see in this snapshot a lot of articles >> > that are interesting but don't need to be nearly so detailed for >> > our audience (and may simply bore). >> >> > Can we try 6000 articles + 21000 ledes, to include every article in >> > Martin's list? >> >> In principle, yeah, but like the revisions work it requires new work >> for detecting leads and putting them into their own articles. My >> gut feeling is that this work just isn't important enough for this >> particular snapshot where our users have access to the net if they >> need it. (Given time constraints.) >> >> > I'm also happy with making this larger than 100MB for g1g1, perhaps >> > even 150MB. In the future our goal can be to expand coverage while >> > reducing size... with less time pressure. >> >> Absolutely. >> >> > We definitely need a template blacklist again. How about the top >> > 5000, excluding certain template categories? >> >> Another 5000 (small) articles is going to have a big impact on disk >> space, I think. We'll see how it looks. >> >> Oh, Mad reminded me that you wanted to see a list of the 2k articles >> that are in the 10k slice and not the 8k slice. Here it is: >> >> http://dev.laptop.org/~cjb/enwiki/8k-10k-diff >> >> - Chris. >> -- >> Chris Ball <[EMAIL PROTECTED]> >> > _______________________________________________ Sugar mailing list [email protected] http://lists.laptop.org/listinfo/sugar

