Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-04-08 Thread Anthony
I'd like to add that the md5 of the *uncompressed* file is
cd4eee6d3d745ce716db2931c160ee35 .  That's what I got from both the
uncompressed 7z and the uncompressed bz2.  They matched, whew.
Uncompressing and md5ing the bz2 took well over a week.  Uncompressing and
md5ing the 7z took less than a day.

On Mon, Mar 29, 2010 at 8:16 PM, Tomasz Finc tf...@wikimedia.org wrote:

 You can find all the md5sums at

 http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-md5sums.txt

 --tomasz

 Anthony wrote:

 Got an md5sum?


 On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc tf...@wikimedia.orgmailto:
 tf...@wikimedia.org wrote:

I love lzma compression.

enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB

enwiki-20100130-pages-meta-history.xml.7z 31.9 GB

Download at http://tinyurl.com/yeelbse

Enjoy!

--tomasz

Tomasz Finc wrote:
  Tomasz Finc wrote:
  New full history en wiki snapshot is hot off the presses!
 
  It's currently being checksummed which will take a while for
280GB+ of
  compressed data but for those brave souls willing to test please
grab it
  from
 
 

 http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2
 
 
  and give us feedback about its quality. This run took just over
a month
  and gained a huge speed up after Tims work on re-compressing ES.
If we
  see no hiccups with this data snapshot, I'll start mirroring it
to other
  locations (internet archive, amazon public data sets, etc).
 
  For those not familiar, the last successful run that we've seen
of this
  data goes all the way back to 2008-10-03. That's over 1.5 years of
  people waiting to get access to these data bits.
 
  I'm excited to say that we seem to have it :)
 
  --tomasz
 
  We now have an md5sum for
 enwiki-20100130-pages-meta-history.xml.bz2.
 
  65677bc275442c7579857cc26b355ded
 
  Please verify against it before filing issues.
 
  --tomasz
 
 
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
mailto:Wikitech-l@lists.wikimedia.org

  https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Xmldatadumps-admin-l mailing list
xmldatadumps-admi...@lists.wikimedia.org
mailto:xmldatadumps-admi...@lists.wikimedia.org

https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-04-08 Thread Anthony
On Thu, Apr 8, 2010 at 7:34 PM, Q overlo...@gmail.com wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 On 4/8/2010 4:28 PM, Anthony wrote:
  I'd like to add that the md5 of the *uncompressed* file is
  cd4eee6d3d745ce716db2931c160ee35 .  That's what I got from both the
  uncompressed 7z and the uncompressed bz2.  They matched, whew.
  Uncompressing and md5ing the bz2 took well over a week.  Uncompressing
 and
  md5ing the 7z took less than a day.
 

 Dumping and parsing large XML files came up at work today which made me
 think of this, how big exactly is the uncompressed file?


5.34 terabytes was the figure I got.

7z l enwiki-20100130-pages-meta-history.xml.7z gives an uncompressed size
of 5873134833455. I assume that's bytes, and googling 5873134833455 bytes
to terabytes gives me 5.34158501 terabytes.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-29 Thread Anthony
Got an md5sum?

On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc tf...@wikimedia.org wrote:

 I love lzma compression.

 enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB

 enwiki-20100130-pages-meta-history.xml.7z 31.9 GB

 Download at http://tinyurl.com/yeelbse

 Enjoy!

 --tomasz

 Tomasz Finc wrote:
  Tomasz Finc wrote:
  New full history en wiki snapshot is hot off the presses!
 
  It's currently being checksummed which will take a while for 280GB+ of
  compressed data but for those brave souls willing to test please grab it
  from
 
 
 http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2
 
 
  and give us feedback about its quality. This run took just over a month
  and gained a huge speed up after Tims work on re-compressing ES. If we
  see no hiccups with this data snapshot, I'll start mirroring it to other
  locations (internet archive, amazon public data sets, etc).
 
  For those not familiar, the last successful run that we've seen of this
  data goes all the way back to 2008-10-03. That's over 1.5 years of
  people waiting to get access to these data bits.
 
  I'm excited to say that we seem to have it :)
 
  --tomasz
 
  We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
 
  65677bc275442c7579857cc26b355ded
 
  Please verify against it before filing issues.
 
  --tomasz
 
 
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l


 ___
 Xmldatadumps-admin-l mailing list
 xmldatadumps-admi...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-29 Thread Tomasz Finc
You can find all the md5sums at

http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-md5sums.txt

--tomasz

Anthony wrote:
 Got an md5sum?
 
 On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc tf...@wikimedia.org 
 mailto:tf...@wikimedia.org wrote:
 
 I love lzma compression.
 
 enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB
 
 enwiki-20100130-pages-meta-history.xml.7z 31.9 GB
 
 Download at http://tinyurl.com/yeelbse
 
 Enjoy!
 
 --tomasz
 
 Tomasz Finc wrote:
   Tomasz Finc wrote:
   New full history en wiki snapshot is hot off the presses!
  
   It's currently being checksummed which will take a while for
 280GB+ of
   compressed data but for those brave souls willing to test please
 grab it
   from
  
  
 
 http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2
  
  
   and give us feedback about its quality. This run took just over
 a month
   and gained a huge speed up after Tims work on re-compressing ES.
 If we
   see no hiccups with this data snapshot, I'll start mirroring it
 to other
   locations (internet archive, amazon public data sets, etc).
  
   For those not familiar, the last successful run that we've seen
 of this
   data goes all the way back to 2008-10-03. That's over 1.5 years of
   people waiting to get access to these data bits.
  
   I'm excited to say that we seem to have it :)
  
   --tomasz
  
   We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
  
   65677bc275442c7579857cc26b355ded
  
   Please verify against it before filing issues.
  
   --tomasz
  
  
   ___
   Wikitech-l mailing list
   Wikitech-l@lists.wikimedia.org
 mailto:Wikitech-l@lists.wikimedia.org
   https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 
 
 ___
 Xmldatadumps-admin-l mailing list
 xmldatadumps-admi...@lists.wikimedia.org
 mailto:xmldatadumps-admi...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
 
 


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-19 Thread zh509
On Mar 19 2010, Platonides wrote:

Zeyi wrote:
 Hi,

 Firstly, congratulations for this! as i Know it has taken for a long 
 time!

 and May I ask a small question: what difference between current dump and
 history dump. I know current one only includes current edits, and history
 one has all edits as introduction said.

You have explained the difference perfectly :)

 More specifically, how different
 shows on one article? Can anyone explain it in detail, please?

It doesn't show the article. It's just a really really large bunch of 
wikitext separated by xml tags.
It is shown by a tool. If you just wwant to read the articles, you don't 
need histories.

What I mean is that if the current dump show there are 30 edits under the 
particular article name, and history dump show there are 100 edits under 
the same article. what's different between these 30 and 100?

If i say that the current dump can explain how the current articles 
established from different edits, is that correct?

 Additionally, why all the statistics of Wikipedia only use history dump 
 for analysis?

Because they study things like changes made to articles, number of edits 
per time...

 Thanks very much!

You're welcome.



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-19 Thread Conrad Irwin

On 03/19/2010 11:02 AM, zh...@york.ac.uk wrote:

 What I mean is that if the current dump show there are 30 edits under the 
 particular article name, and history dump show there are 100 edits under 
 the same article. what's different between these 30 and 100?

The current dump shows 1 edit for each article, only the most recent at
the time that article was processed. The history dump shows all edits
for all articles.

Conrad

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-19 Thread zh509
On Mar 19 2010, Conrad Irwin wrote:


On 03/19/2010 11:02 AM, zh...@york.ac.uk wrote:

 What I mean is that if the current dump show there are 30 edits under 
 the particular article name, and history dump show there are 100 edits 
 under the same article. what's different between these 30 and 100?

The current dump shows 1 edit for each article, only the most recent at
the time that article was processed. The history dump shows all edits
for all articles.

Wow, can you confirm that only the lastest edit can be collected by the 
current dump? So, the current dump isn't meaningful in the term of 
statistics?


Conrad
thanks,
Zeyi
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-18 Thread zh509
Hi, 

Firstly, congratulations for this! as i Know it has taken for a long time!

and May I ask a small question: what difference between current dump and 
history dump. I know current one only includes current edits, and history 
one has all edits as introduction said. More specifically, how different 
shows on one article? Can anyone explain it in detail, please?

Additionally, why all the statistics of Wikipedia only use history dump for 
analysis?

Thanks very much!

Zeyi


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-18 Thread Platonides
Zeyi wrote:
 Hi,

 Firstly, congratulations for this! as i Know it has taken for a long time!

 and May I ask a small question: what difference between current dump and
 history dump. I know current one only includes current edits, and history
 one has all edits as introduction said.

You have explained the difference perfectly :)

 More specifically, how different
 shows on one article? Can anyone explain it in detail, please?

It doesn't show the article. It's just a really really large bunch of 
wikitext separated by xml tags.
It is shown by a tool. If you just wwant to read the articles, you don't 
need histories.

 Additionally, why all the statistics of Wikipedia only use history dump for
 analysis?

Because they study things like changes made to articles, number of edits 
per time...

 Thanks very much!

You're welcome.



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-17 Thread Jamie Morken



Date: 
Wed, 17 Mar 2010 15:15:24 +0100
From: Platonides 
platoni...@gmail.com
Subject: Re: [Wikitech-l] 
[Xmldatadumps-admin-l] 2010-03-11 01:10:08:
enwiki Checksumming 
pages-meta-history.xml.bz2 :D
To: wikitech-l@lists.wikimedia.org
Message-ID:
 hnqo49$it...@dough.gmane.org
Content-Type: text/plain; 
charset=ISO-8859-1; format=flowed

Jamie Morken wrote:
 
Also I wonder if it is possible to convert from 7z to bz2 without having
  
 to make the 5469GB file first?  If this can be done then having only 7z
  
 files would be fine, as the bz2 file could be created with a normal

 PC (ie one without a 6TB+ harddrive).  This would be a good solution,

 but not sure if it can be done.  If it could though, might as well get

 rid of all the large wiki's bz2 pages-meta-history files!

Sure.
7z
 e -so DatabaseDump.7z | bzip -9  DatabaseDump.bz


Hi,

Thanks for the info, I think 7z is the way to go :)

cheers,
Jamie






___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-17 Thread Felipe Ortega


--- El mar, 16/3/10, Kevin Webb kpw...@gmail.com escribió:

 De: Kevin Webb kpw...@gmail.com
 Asunto: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming 
 pages-meta-history.xml.bz2 :D
 Para: Tomasz Finc tf...@wikimedia.org
 CC: Wikimedia developers wikitech-l@lists.wikimedia.org, 
 xmldatadumps-admi...@lists.wikimedia.org, xmldatadump...@lists.wikimedia.org
 Fecha: martes, 16 de marzo, 2010 21:10
 I just managed to finish
 decompression. That took about 54 hours on an
 EC2 2.5x unit CPU. The final data size is 5469GB.
 
 As the process just finished I haven't been able to check
 the
 integrity of the XML, however, the bzip stream itself
 appears to be
 good.
 
 As was mentioned previously, it would be great if you could
 compress
 future archives using pbzib to allow for parallel
 decompression. As I
 understand it, the pbzip files are reverse compatible with
 all
 existing bzip2 utilities.
 

Yes, they're :-).

Regards,
F.

 Thanks again for all your work on this!
 Kevin
 
 
 On Tue, Mar 16, 2010 at 4:05 PM, Tomasz Finc tf...@wikimedia.org
 wrote:
  Tomasz Finc wrote:
  New full history en wiki snapshot is hot off the
 presses!
 
  It's currently being checksummed which will take a
 while for 280GB+ of
  compressed data but for those brave souls willing
 to test please grab it
  from
 
  http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2
 
  and give us feedback about its quality. This run
 took just over a month
  and gained a huge speed up after Tims work on
 re-compressing ES. If we
  see no hiccups with this data snapshot, I'll start
 mirroring it to other
  locations (internet archive, amazon public data
 sets, etc).
 
  For those not familiar, the last successful run
 that we've seen of this
  data goes all the way back to 2008-10-03. That's
 over 1.5 years of
  people waiting to get access to these data bits.
 
  I'm excited to say that we seem to have it :)
 
  So now that we've had it for a couple of days .. can I
 get a status
  report from someone about its quality?
 
  Even if you had no issues please let us know so that
 we start mirroring.
 
  --tomasz
 
  ___
  Xmldatadumps-admin-l mailing list
  xmldatadumps-admi...@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
 
 
 ___
 Xmldatadumps-admin-l mailing list
 xmldatadumps-admi...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
 


  

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-17 Thread Felipe Ortega
Let alone that, for some of us outside USA (and even with a good connection to 
the EU resarch network) the download process takes, so to say, slightly more 
time than expected (and is prone to errors as the file gets larger).

So other +1 to replace bzip with 7zip.

F. 

--- El mar, 16/3/10, Kevin Webb kpw...@gmail.com escribió:

 De: Kevin Webb kpw...@gmail.com
 Asunto: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming 
 pages-meta-history.xml.bz2 :D
 Para: Lev Muchnik levmuch...@gmail.com
 CC: Wikimedia developers wikitech-l@lists.wikimedia.org, 
 xmldatadumps-admi...@lists.wikimedia.org, xmldatadump...@lists.wikimedia.org
 Fecha: martes, 16 de marzo, 2010 22:35
 Yeah, same here. I'm totally fine
 with replacing bzip with 7zip as the
 primary format for the dumps. Seems like it solves the
 space and speed
 problems together...
 
 I just did a quick benchmark and got a 7x improvement on
 decompression
 speed using 7zip over bzip using a single core, based on
 actual dump
 data.
 
 kpw
 
 
 
 On Tue, Mar 16, 2010 at 4:54 PM, Lev Muchnik levmuch...@gmail.com
 wrote:
 
  I am entirely for 7z. In fact, once released, I'll be
 able to test the XML
  integrity right away - I process the data on the fly,
 without  unpacking it
  first.
 
 
  On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc tf...@wikimedia.org
 wrote:
 
  Kevin Webb wrote:
   I just managed to finish decompression. That
 took about 54 hours on an
   EC2 2.5x unit CPU. The final data size is
 5469GB.
  
   As the process just finished I haven't been
 able to check the
   integrity of the XML, however, the bzip
 stream itself appears to be
   good.
  
   As was mentioned previously, it would be
 great if you could compress
   future archives using pbzib to allow for
 parallel decompression. As I
   understand it, the pbzip files are reverse
 compatible with all
   existing bzip2 utilities.
 
  Looks like the trade off is slightly larger files
 due to pbzip2's
  algorithm for individual chunking. We'd have to
 change the
 
  buildFilters function in http://tinyurl.com/yjun6n5 and install the new
  binary. Ubuntu already has it in 8.04 LTS making
 it easy.
 
  Any takers for the change?
 
  I'd also like to gauge everyones opinion on moving
 away from the large
  file sizes of bz2 and going exclusively 7z. We'd
 save a huge amount of
  space doing it at a slightly larger cost during
 compression.
  Decompression of 7z these days is wicked fast.
 
  let know
 
  --tomasz
 
 
 
 
 
 
  ___
  Xmldatadumps-admin-l mailing list
  xmldatadumps-admi...@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
 
 
 
 ___
 Xmldatadumps-admin-l mailing list
 xmldatadumps-admi...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
 


  


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-16 Thread Tomasz Finc
Tomasz Finc wrote:
 New full history en wiki snapshot is hot off the presses!
 
 It's currently being checksummed which will take a while for 280GB+ of 
 compressed data but for those brave souls willing to test please grab it 
 from
 
 http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2
 
 and give us feedback about its quality. This run took just over a month 
 and gained a huge speed up after Tims work on re-compressing ES. If we 
 see no hiccups with this data snapshot, I'll start mirroring it to other 
 locations (internet archive, amazon public data sets, etc).
 
 For those not familiar, the last successful run that we've seen of this 
 data goes all the way back to 2008-10-03. That's over 1.5 years of 
 people waiting to get access to these data bits.
 
 I'm excited to say that we seem to have it :)

So now that we've had it for a couple of days .. can I get a status 
report from someone about its quality?

Even if you had no issues please let us know so that we start mirroring.

--tomasz

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-11 Thread Felipe Ortega


--- El jue, 11/3/10, Tomasz Finc tf...@wikimedia.org escribió:

 De: Tomasz Finc tf...@wikimedia.org
 Asunto: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming 
 pages-meta-history.xml.bz2 :D
 Para: Wikimedia developers wikitech-l@lists.wikimedia.org, 
 xmldatadumps-admi...@lists.wikimedia.org, xmldatadu...@lists.wikimedia.org
 Fecha: jueves, 11 de marzo, 2010 04:10
 New full history en wiki snapshot is
 hot off the presses!
 
 It's currently being checksummed which will take a while
 for 280GB+ of 
 compressed data but for those brave souls willing to test
 please grab it 
 from
 
 http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2
 
 and give us feedback about its quality. This run took just
 over a month 
 and gained a huge speed up after Tims work on
 re-compressing ES. If we 
 see no hiccups with this data snapshot, I'll start
 mirroring it to other 
 locations (internet archive, amazon public data sets,
 etc).

Really good news :-)

 
 For those not familiar, the last successful run that we've
 seen of this 
 data goes all the way back to 2008-10-03. That's over 1.5
 years of 
 people waiting to get access to these data bits.
 

In fact, something went wrong with that one, as well. The last valid full dump 
(afaik) was 2008-03-03, containing data up to early January 2008.

 I'm excited to say that we seem to have it :)
 

Let's cross our fingers. Congrats for the great job, guys!!

Felipe

 --tomasz
 
 ___
 Xmldatadumps-admin-l mailing list
 xmldatadumps-admi...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
 


  

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-11 Thread Platonides
Tomasz Finc wrote:
 Brian J Mingus wrote:

 On Wed, Mar 10, 2010 at 8:54 PM, Tomasz Finctf...@wikimedia.org
 mailto:tf...@wikimedia.org  wrote:

  Yup, that's the one. If you have a fast upload pipe then I'm more then
  happy to setup space for it. Otherwise it should be arriving in our
  snail mail after a couple of days.

  -tomasz


 Anyone may download the file from me here:

 http://grey.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z

 The md5sum is:

 20a201afc05a4e5f2f6c3b9b7afa225c  enwiki-20080103-pages-meta-history.xml.7z

 The file size is:

 18522193111 (~18 gigabytes)

 I'm sure you will find my pipe fat enough..;-)

 That seem way too tiny to be the real thing.

 --tomasz

I also have a copy of it. The md5sum and file size are the right ones of 
the file that was published on downloads.wikimedia.org

I have the .sql.gz too, if you want them.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-10 Thread Tomasz Finc
Thankfully due to an awesome volunteer we'll be able to get that 2008 
snapshot in our archive. I'll mail out when it shows up in our snail mail.

--tomasz

Erik Zachte wrote:
 I'm thrilled. Big thanks to Tim and Tomasz for pulling this off.
 For the record the 2008-10-03 dump existed for a short while only.
 It evaporated before wikistats and many others could parse it, 
 so now we can finally catch up from 3.5 (!) years backlog.
 
 Erik Zachte
 
 -Original Message-
 From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l-
 boun...@lists.wikimedia.org] On Behalf Of Tomasz Finc
 Sent: Thursday, March 11, 2010 4:11
 To: Wikimedia developers; xmldatadumps-admi...@lists.wikimedia.org;
 xmldatadu...@lists.wikimedia.org
 Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages-
 meta-history.xml.bz2 :D

 New full history en wiki snapshot is hot off the presses!

 It's currently being checksummed which will take a while for 280GB+ of
 compressed data but for those brave souls willing to test please grab
 it
 from

 http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-
 meta-history.xml.bz2

 and give us feedback about its quality. This run took just over a month
 and gained a huge speed up after Tims work on re-compressing ES. If we
 see no hiccups with this data snapshot, I'll start mirroring it to
 other
 locations (internet archive, amazon public data sets, etc).

 For those not familiar, the last successful run that we've seen of this
 data goes all the way back to 2008-10-03. That's over 1.5 years of
 people waiting to get access to these data bits.

 I'm excited to say that we seem to have it :)

 --tomasz

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 
 
 
 ___
 Xmldatadumps-admin-l mailing list
 xmldatadumps-admi...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-10 Thread Tomasz Finc
Yup, that's the one. If you have a fast upload pipe then I'm more then 
happy to setup space for it. Otherwise it should be arriving in our 
snail mail after a couple of days.

-tomasz

Kevin Webb wrote:
 Many thanks to everyone involved.
 
 Also, in case it's of use to anyone I have a copy of the
 enwiki-20080103-pages-meta-history.xml dump in 7z form. Is that the
 backup that's beeing referred to or is it in fact 20081003?
 
 kpw
 
 On Wed, Mar 10, 2010 at 10:20 PM, Tomasz Finc tf...@wikimedia.org wrote:
 Thankfully due to an awesome volunteer we'll be able to get that 2008
 snapshot in our archive. I'll mail out when it shows up in our snail mail.

 --tomasz

 Erik Zachte wrote:
 I'm thrilled. Big thanks to Tim and Tomasz for pulling this off.
 For the record the 2008-10-03 dump existed for a short while only.
 It evaporated before wikistats and many others could parse it,
 so now we can finally catch up from 3.5 (!) years backlog.

 Erik Zachte

 -Original Message-
 From: wikitech-l-boun...@lists.wikimedia.org [mailto:wikitech-l-
 boun...@lists.wikimedia.org] On Behalf Of Tomasz Finc
 Sent: Thursday, March 11, 2010 4:11
 To: Wikimedia developers; xmldatadumps-admi...@lists.wikimedia.org;
 xmldatadu...@lists.wikimedia.org
 Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages-
 meta-history.xml.bz2 :D

 New full history en wiki snapshot is hot off the presses!

 It's currently being checksummed which will take a while for 280GB+ of
 compressed data but for those brave souls willing to test please grab
 it
 from

 http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-
 meta-history.xml.bz2

 and give us feedback about its quality. This run took just over a month
 and gained a huge speed up after Tims work on re-compressing ES. If we
 see no hiccups with this data snapshot, I'll start mirroring it to
 other
 locations (internet archive, amazon public data sets, etc).

 For those not familiar, the last successful run that we've seen of this
 data goes all the way back to 2008-10-03. That's over 1.5 years of
 people waiting to get access to these data bits.

 I'm excited to say that we seem to have it :)

 --tomasz

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


 ___
 Xmldatadumps-admin-l mailing list
 xmldatadumps-admi...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l

 ___
 Xmldatadumps-admin-l mailing list
 xmldatadumps-admi...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-10 Thread Tomasz Finc
Brian J Mingus wrote:
 
 On Wed, Mar 10, 2010 at 8:54 PM, Tomasz Finc tf...@wikimedia.org 
 mailto:tf...@wikimedia.org wrote:
 
 Yup, that's the one. If you have a fast upload pipe then I'm more then
 happy to setup space for it. Otherwise it should be arriving in our
 snail mail after a couple of days.
 
 -tomasz
 
 
 Anyone may download the file from me here:
 
 http://grey.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z
 
 The md5sum is:
 
 20a201afc05a4e5f2f6c3b9b7afa225c  enwiki-20080103-pages-meta-history.xml.7z
 
 The file size is:
 
 18522193111 (~18 gigabytes)
 
 I'm sure you will find my pipe fat enough..;-)
 
 
 
 
 ___
 Xmldatadumps-admin-l mailing list
 xmldatadumps-admi...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l

That seem way too tiny to be the real thing.

--tomasz

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l