Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
I'd like to add that the md5 of the *uncompressed* file is cd4eee6d3d745ce716db2931c160ee35 . That's what I got from both the uncompressed 7z and the uncompressed bz2. They matched, whew. Uncompressing and md5ing the bz2 took well over a week. Uncompressing and md5ing the 7z took less than a day. On Mon, Mar 29, 2010 at 8:16 PM, Tomasz Finc tf...@wikimedia.org wrote: You can find all the md5sums at http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-md5sums.txt --tomasz Anthony wrote: Got an md5sum? On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc tf...@wikimedia.orgmailto: tf...@wikimedia.org wrote: I love lzma compression. enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB enwiki-20100130-pages-meta-history.xml.7z 31.9 GB Download at http://tinyurl.com/yeelbse Enjoy! --tomasz Tomasz Finc wrote: Tomasz Finc wrote: New full history en wiki snapshot is hot off the presses! It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2 and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc). For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits. I'm excited to say that we seem to have it :) --tomasz We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2. 65677bc275442c7579857cc26b355ded Please verify against it before filing issues. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org mailto:Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Xmldatadumps-admin-l mailing list xmldatadumps-admi...@lists.wikimedia.org mailto:xmldatadumps-admi...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
On Thu, Apr 8, 2010 at 7:34 PM, Q overlo...@gmail.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 4/8/2010 4:28 PM, Anthony wrote: I'd like to add that the md5 of the *uncompressed* file is cd4eee6d3d745ce716db2931c160ee35 . That's what I got from both the uncompressed 7z and the uncompressed bz2. They matched, whew. Uncompressing and md5ing the bz2 took well over a week. Uncompressing and md5ing the 7z took less than a day. Dumping and parsing large XML files came up at work today which made me think of this, how big exactly is the uncompressed file? 5.34 terabytes was the figure I got. 7z l enwiki-20100130-pages-meta-history.xml.7z gives an uncompressed size of 5873134833455. I assume that's bytes, and googling 5873134833455 bytes to terabytes gives me 5.34158501 terabytes. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Got an md5sum? On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc tf...@wikimedia.org wrote: I love lzma compression. enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB enwiki-20100130-pages-meta-history.xml.7z 31.9 GB Download at http://tinyurl.com/yeelbse Enjoy! --tomasz Tomasz Finc wrote: Tomasz Finc wrote: New full history en wiki snapshot is hot off the presses! It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2 and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc). For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits. I'm excited to say that we seem to have it :) --tomasz We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2. 65677bc275442c7579857cc26b355ded Please verify against it before filing issues. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Xmldatadumps-admin-l mailing list xmldatadumps-admi...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
You can find all the md5sums at http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-md5sums.txt --tomasz Anthony wrote: Got an md5sum? On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc tf...@wikimedia.org mailto:tf...@wikimedia.org wrote: I love lzma compression. enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB enwiki-20100130-pages-meta-history.xml.7z 31.9 GB Download at http://tinyurl.com/yeelbse Enjoy! --tomasz Tomasz Finc wrote: Tomasz Finc wrote: New full history en wiki snapshot is hot off the presses! It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2 and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc). For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits. I'm excited to say that we seem to have it :) --tomasz We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2. 65677bc275442c7579857cc26b355ded Please verify against it before filing issues. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org mailto:Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Xmldatadumps-admin-l mailing list xmldatadumps-admi...@lists.wikimedia.org mailto:xmldatadumps-admi...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
On Mar 19 2010, Platonides wrote: Zeyi wrote: Hi, Firstly, congratulations for this! as i Know it has taken for a long time! and May I ask a small question: what difference between current dump and history dump. I know current one only includes current edits, and history one has all edits as introduction said. You have explained the difference perfectly :) More specifically, how different shows on one article? Can anyone explain it in detail, please? It doesn't show the article. It's just a really really large bunch of wikitext separated by xml tags. It is shown by a tool. If you just wwant to read the articles, you don't need histories. What I mean is that if the current dump show there are 30 edits under the particular article name, and history dump show there are 100 edits under the same article. what's different between these 30 and 100? If i say that the current dump can explain how the current articles established from different edits, is that correct? Additionally, why all the statistics of Wikipedia only use history dump for analysis? Because they study things like changes made to articles, number of edits per time... Thanks very much! You're welcome. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
On 03/19/2010 11:02 AM, zh...@york.ac.uk wrote: What I mean is that if the current dump show there are 30 edits under the particular article name, and history dump show there are 100 edits under the same article. what's different between these 30 and 100? The current dump shows 1 edit for each article, only the most recent at the time that article was processed. The history dump shows all edits for all articles. Conrad ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
On Mar 19 2010, Conrad Irwin wrote: On 03/19/2010 11:02 AM, zh...@york.ac.uk wrote: What I mean is that if the current dump show there are 30 edits under the particular article name, and history dump show there are 100 edits under the same article. what's different between these 30 and 100? The current dump shows 1 edit for each article, only the most recent at the time that article was processed. The history dump shows all edits for all articles. Wow, can you confirm that only the lastest edit can be collected by the current dump? So, the current dump isn't meaningful in the term of statistics? Conrad thanks, Zeyi ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Hi, Firstly, congratulations for this! as i Know it has taken for a long time! and May I ask a small question: what difference between current dump and history dump. I know current one only includes current edits, and history one has all edits as introduction said. More specifically, how different shows on one article? Can anyone explain it in detail, please? Additionally, why all the statistics of Wikipedia only use history dump for analysis? Thanks very much! Zeyi ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Zeyi wrote: Hi, Firstly, congratulations for this! as i Know it has taken for a long time! and May I ask a small question: what difference between current dump and history dump. I know current one only includes current edits, and history one has all edits as introduction said. You have explained the difference perfectly :) More specifically, how different shows on one article? Can anyone explain it in detail, please? It doesn't show the article. It's just a really really large bunch of wikitext separated by xml tags. It is shown by a tool. If you just wwant to read the articles, you don't need histories. Additionally, why all the statistics of Wikipedia only use history dump for analysis? Because they study things like changes made to articles, number of edits per time... Thanks very much! You're welcome. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Date: Wed, 17 Mar 2010 15:15:24 +0100 From: Platonides platoni...@gmail.com Subject: Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D To: wikitech-l@lists.wikimedia.org Message-ID: hnqo49$it...@dough.gmane.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Jamie Morken wrote: Also I wonder if it is possible to convert from 7z to bz2 without having to make the 5469GB file first? If this can be done then having only 7z files would be fine, as the bz2 file could be created with a normal PC (ie one without a 6TB+ harddrive). This would be a good solution, but not sure if it can be done. If it could though, might as well get rid of all the large wiki's bz2 pages-meta-history files! Sure. 7z e -so DatabaseDump.7z | bzip -9 DatabaseDump.bz Hi, Thanks for the info, I think 7z is the way to go :) cheers, Jamie ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
--- El mar, 16/3/10, Kevin Webb kpw...@gmail.com escribió: De: Kevin Webb kpw...@gmail.com Asunto: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D Para: Tomasz Finc tf...@wikimedia.org CC: Wikimedia developers wikitech-l@lists.wikimedia.org, xmldatadumps-admi...@lists.wikimedia.org, xmldatadump...@lists.wikimedia.org Fecha: martes, 16 de marzo, 2010 21:10 I just managed to finish decompression. That took about 54 hours on an EC2 2.5x unit CPU. The final data size is 5469GB. As the process just finished I haven't been able to check the integrity of the XML, however, the bzip stream itself appears to be good. As was mentioned previously, it would be great if you could compress future archives using pbzib to allow for parallel decompression. As I understand it, the pbzip files are reverse compatible with all existing bzip2 utilities. Yes, they're :-). Regards, F. Thanks again for all your work on this! Kevin On Tue, Mar 16, 2010 at 4:05 PM, Tomasz Finc tf...@wikimedia.org wrote: Tomasz Finc wrote: New full history en wiki snapshot is hot off the presses! It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2 and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc). For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits. I'm excited to say that we seem to have it :) So now that we've had it for a couple of days .. can I get a status report from someone about its quality? Even if you had no issues please let us know so that we start mirroring. --tomasz ___ Xmldatadumps-admin-l mailing list xmldatadumps-admi...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l ___ Xmldatadumps-admin-l mailing list xmldatadumps-admi...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Let alone that, for some of us outside USA (and even with a good connection to the EU resarch network) the download process takes, so to say, slightly more time than expected (and is prone to errors as the file gets larger). So other +1 to replace bzip with 7zip. F. --- El mar, 16/3/10, Kevin Webb kpw...@gmail.com escribió: De: Kevin Webb kpw...@gmail.com Asunto: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D Para: Lev Muchnik levmuch...@gmail.com CC: Wikimedia developers wikitech-l@lists.wikimedia.org, xmldatadumps-admi...@lists.wikimedia.org, xmldatadump...@lists.wikimedia.org Fecha: martes, 16 de marzo, 2010 22:35 Yeah, same here. I'm totally fine with replacing bzip with 7zip as the primary format for the dumps. Seems like it solves the space and speed problems together... I just did a quick benchmark and got a 7x improvement on decompression speed using 7zip over bzip using a single core, based on actual dump data. kpw On Tue, Mar 16, 2010 at 4:54 PM, Lev Muchnik levmuch...@gmail.com wrote: I am entirely for 7z. In fact, once released, I'll be able to test the XML integrity right away - I process the data on the fly, without unpacking it first. On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc tf...@wikimedia.org wrote: Kevin Webb wrote: I just managed to finish decompression. That took about 54 hours on an EC2 2.5x unit CPU. The final data size is 5469GB. As the process just finished I haven't been able to check the integrity of the XML, however, the bzip stream itself appears to be good. As was mentioned previously, it would be great if you could compress future archives using pbzib to allow for parallel decompression. As I understand it, the pbzip files are reverse compatible with all existing bzip2 utilities. Looks like the trade off is slightly larger files due to pbzip2's algorithm for individual chunking. We'd have to change the buildFilters function in http://tinyurl.com/yjun6n5 and install the new binary. Ubuntu already has it in 8.04 LTS making it easy. Any takers for the change? I'd also like to gauge everyones opinion on moving away from the large file sizes of bz2 and going exclusively 7z. We'd save a huge amount of space doing it at a slightly larger cost during compression. Decompression of 7z these days is wicked fast. let know --tomasz ___ Xmldatadumps-admin-l mailing list xmldatadumps-admi...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l ___ Xmldatadumps-admin-l mailing list xmldatadumps-admi...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Tomasz Finc wrote: New full history en wiki snapshot is hot off the presses! It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2 and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc). For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits. I'm excited to say that we seem to have it :) So now that we've had it for a couple of days .. can I get a status report from someone about its quality? Even if you had no issues please let us know so that we start mirroring. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
--- El jue, 11/3/10, Tomasz Finc tf...@wikimedia.org escribió: De: Tomasz Finc tf...@wikimedia.org Asunto: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D Para: Wikimedia developers wikitech-l@lists.wikimedia.org, xmldatadumps-admi...@lists.wikimedia.org, xmldatadu...@lists.wikimedia.org Fecha: jueves, 11 de marzo, 2010 04:10 New full history en wiki snapshot is hot off the presses! It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2 and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc). Really good news :-) For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits. In fact, something went wrong with that one, as well. The last valid full dump (afaik) was 2008-03-03, containing data up to early January 2008. I'm excited to say that we seem to have it :) Let's cross our fingers. Congrats for the great job, guys!! Felipe --tomasz ___ Xmldatadumps-admin-l mailing list xmldatadumps-admi...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Tomasz Finc wrote: Brian J Mingus wrote: On Wed, Mar 10, 2010 at 8:54 PM, Tomasz Finctf...@wikimedia.org mailto:tf...@wikimedia.org wrote: Yup, that's the one. If you have a fast upload pipe then I'm more then happy to setup space for it. Otherwise it should be arriving in our snail mail after a couple of days. -tomasz Anyone may download the file from me here: http://grey.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z The md5sum is: 20a201afc05a4e5f2f6c3b9b7afa225c enwiki-20080103-pages-meta-history.xml.7z The file size is: 18522193111 (~18 gigabytes) I'm sure you will find my pipe fat enough..;-) That seem way too tiny to be the real thing. --tomasz I also have a copy of it. The md5sum and file size are the right ones of the file that was published on downloads.wikimedia.org I have the .sql.gz too, if you want them. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail. --tomasz Erik Zachte wrote: I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. For the record the 2008-10-03 dump existed for a short while only. It evaporated before wikistats and many others could parse it, so now we can finally catch up from 3.5 (!) years backlog. Erik Zachte -Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l- boun...@lists.wikimedia.org] On Behalf Of Tomasz Finc Sent: Thursday, March 11, 2010 4:11 To: Wikimedia developers; xmldatadumps-admi...@lists.wikimedia.org; xmldatadu...@lists.wikimedia.org Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- meta-history.xml.bz2 :D New full history en wiki snapshot is hot off the presses! It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- meta-history.xml.bz2 and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc). For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits. I'm excited to say that we seem to have it :) --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Xmldatadumps-admin-l mailing list xmldatadumps-admi...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Yup, that's the one. If you have a fast upload pipe then I'm more then happy to setup space for it. Otherwise it should be arriving in our snail mail after a couple of days. -tomasz Kevin Webb wrote: Many thanks to everyone involved. Also, in case it's of use to anyone I have a copy of the enwiki-20080103-pages-meta-history.xml dump in 7z form. Is that the backup that's beeing referred to or is it in fact 20081003? kpw On Wed, Mar 10, 2010 at 10:20 PM, Tomasz Finc tf...@wikimedia.org wrote: Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail. --tomasz Erik Zachte wrote: I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. For the record the 2008-10-03 dump existed for a short while only. It evaporated before wikistats and many others could parse it, so now we can finally catch up from 3.5 (!) years backlog. Erik Zachte -Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l- boun...@lists.wikimedia.org] On Behalf Of Tomasz Finc Sent: Thursday, March 11, 2010 4:11 To: Wikimedia developers; xmldatadumps-admi...@lists.wikimedia.org; xmldatadu...@lists.wikimedia.org Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- meta-history.xml.bz2 :D New full history en wiki snapshot is hot off the presses! It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- meta-history.xml.bz2 and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc). For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits. I'm excited to say that we seem to have it :) --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Xmldatadumps-admin-l mailing list xmldatadumps-admi...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l ___ Xmldatadumps-admin-l mailing list xmldatadumps-admi...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Brian J Mingus wrote: On Wed, Mar 10, 2010 at 8:54 PM, Tomasz Finc tf...@wikimedia.org mailto:tf...@wikimedia.org wrote: Yup, that's the one. If you have a fast upload pipe then I'm more then happy to setup space for it. Otherwise it should be arriving in our snail mail after a couple of days. -tomasz Anyone may download the file from me here: http://grey.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z The md5sum is: 20a201afc05a4e5f2f6c3b9b7afa225c enwiki-20080103-pages-meta-history.xml.7z The file size is: 18522193111 (~18 gigabytes) I'm sure you will find my pipe fat enough..;-) ___ Xmldatadumps-admin-l mailing list xmldatadumps-admi...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l That seem way too tiny to be the real thing. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l