FYI
--- Begin Message ---
Hi all,
Congrats Ariel! :) The sum of pages-meta-history files for the last two
enwiki dumps are 342.7GB for the 20110115 dump and 353.5GB for the
20110317 dump, which shows that the overall dump size grew over 2
months. Seven of the individually numbered pages-meta-history files
reduced in size while eight increased in size from 20110115 to
20110317. By far the biggest decrease was the
pages-meta-history10.xml.bz2 file which dropped from 18.7GB down to
1.9GB. I think there is probably missing revisions in that page ID
range.
Here are some historical dumps sizes for comparison to show the growth of these
files:
enwiki-20060816-pages-meta-history.xml.7z 5.08GB
enwiki-20070402-pages-meta-history.xml.7z 11.3GB (229 days since previous dump)
enwiki-20080103-pages-meta-history.xml.7z 17.2GB (276 days since previous dump)
enwiki-20100130-pages-meta-history.xml.7z 31.8GB (758 days since previous dump)
enwiki-20110115-pages-meta-history[1-15].xml.7z 38.0GB (350 days since previous
dump)
enwiki-20110115-pages-meta-history[1-15].xml.7z (7z compression in progress)
Here's a graph of this data showing the dump file size growth seems to be
pretty linear:
(chart x-axis starts from 20060816 dump and ends at 20110115 dump)
"http://nekrom.com/wikipedia/enwiki%20history%20dump%20file%20size%20over%20time.png"
cheers,
Jamie
----- Original Message -----
From: "Ariel T. Glenn" <[email protected]>
Date: Tuesday, March 29, 2011 3:24 pm
Subject: [Xmldatadumps-l] March 17 en wikipedia history bz2 files ready
To: [email protected]
Cc: [email protected]
> Well, that used up all my good luck for the year, but the bz2s
> are ready
> for download. The md5sums are still calculating, give them
> a couple
> hours to show up. If all continues to go well we'll have
> the 7z files
> in 4-5 days.
>
> As before I do not plan to provide a single 350gb file of the
> bz2, nor a
> single 7z file for download.
>
> Happy trails,
>
> Ariel
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
_______________________________________________
Xmldatadumps-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
--- End Message ---
--- Begin Message ---
----- Original Message -----
From: Brian J Mingus <[email protected]>
Date: Tuesday, March 29, 2011 7:15 pm
Subject: Re: [Wikitech-l] [Xmldatadumps-l] March 17 en wikipedia history bz2
files ready
To: Wikimedia developers <[email protected]>
Cc: Jamie Morken <[email protected]>, "Ariel T. Glenn" <[email protected]>,
[email protected]
> >
> According to this data the 7z dump for enwp will reach 1
> terabyte on Jan 2,
> 2145.
>
> =)
Hi,
I made a graph for the uncompressed XML file size for the enwiki
pages-meta-history files over time, I thought that these files would be growing
exponentially but they appear to grow linear. For comparison in 2145 the raw
XML should be about 178 TB I think, so the 7z files are growing linearly about
180x faster than the raw XML.
"http://nekrom.com/wikipedia/enwiki%20history%20uncompressed%20XML%20dump%20file%20size%20over%20time.png"
(data below)
cheers,
Jamie
enwiki-20060816-pages-meta-history.xml 782741875000 (728.99 GB)
enwiki-20070402-pages-meta-history.xml 1763048493749 (1641.97 GB) (229 days
since previous dump)
enwiki-20080103-pages-meta-history.xml 2807444044080 (2614.64 GB) (276 days
since previous dump)
enwiki-20100130-pages-meta-history.xml 5873134833455 (5469.78 GB) (758 days
since previous dump)
enwiki-20110115-pages-meta-history[1-15].xml 7218617857754 (6722.86 GB) (350
days since previous dump)
enwiki-20110115-pages-meta-history1.xml 1 080 719 385 129
enwiki-20110115-pages-meta-history2.xml 677 956 948 289
enwiki-20110115-pages-meta-history3.xml 550 889 319 423
enwiki-20110115-pages-meta-history4.xml 447 001 611 247
enwiki-20110115-pages-meta-history5.xml 453 700 983 270
enwiki-20110115-pages-meta-history6.xml 540 208 590 115
enwiki-20110115-pages-meta-history7.xml 458 817 000 243
enwiki-20110115-pages-meta-history8.xml 649 710 293 818
enwiki-20110115-pages-meta-history9.xml 471 183 250 318
enwiki-20110115-pages-meta-history10.xml 406 115 459 739
enwiki-20110115-pages-meta-history11.xml 342 840 308 580
enwiki-20110115-pages-meta-history12.xml 310 507 626 798
enwiki-20110115-pages-meta-history13.xml 362 264 384 002
enwiki-20110115-pages-meta-history14.xml 269 988 897 698
enwiki-20110115-pages-meta-history15.xml 196 713 799 085
>
> --
> Brian Mingus
> Graduate student
> Computational Cognitive Neuroscience Lab
> University of Colorado at Boulder
>
_______________________________________________
Xmldatadumps-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
--- End Message ---
--- Begin Message ---
Hi,
Thanks for the info, while I was at it I did some more checking of the history
dump file sizes and compression ratios (as reported by 7-Zip 9.20):
enwiki-20110115-pages-meta-history1.xml.7z 434.99x compression
enwiki-20110115-pages-meta-history2.xml.7z 289.46x compression
enwiki-20110115-pages-meta-history3.xml.7z 248.72x compression
enwiki-20110115-pages-meta-history4.xml.7z 216.29x compression
enwiki-20110115-pages-meta-history5.xml.7z 198.67x compression
enwiki-20110115-pages-meta-history6.xml.7z 176.94x compression
enwiki-20110115-pages-meta-history7.xml.7z 161.42x compression
enwiki-20110115-pages-meta-history8.xml.7z 208.59x compression
enwiki-20110115-pages-meta-history9.xml.7z 126.86x compression
enwiki-20110115-pages-meta-history10.xml.7z 112.10x compression
enwiki-20110115-pages-meta-history11.xml.7z 117.27x compression
enwiki-20110115-pages-meta-history12.xml.7z 118.88x compression
enwiki-20110115-pages-meta-history13.xml.7z 133.07x compression
enwiki-20110115-pages-meta-history14.xml.7z 107.10x compression
enwiki-20110115-pages-meta-history15.xml.7z 83.24x compression
pages-meta-history1 has the oldest articles and also the most revisions,
therefore it has the
highest compression ratio (as most revisions have only minor changes for
established articles).
The pages-meta-history15 file contains the most recently created articles which
have the least revisions,
but tend to have greater relative changes compared to the overall article size,
and thus has the lowest 7z compression.
enwiki-20110115-pages-meta-history8.xml doesn't follow the pattern of
decreasing compression ratios.
That's all I can report without actually looking inside these files! :)
cheers,
Jamie
----- Original Message -----
From: "Ariel T. Glenn" <[email protected]>
Date: Tuesday, March 29, 2011 11:43 pm
Subject: Re: [Xmldatadumps-l] March 17 en wikipedia history bz2 files ready
To: Jamie Morken <[email protected]>
Cc: [email protected], [email protected]
> The individually numbered files change sizes radically because I'm
> moving around start and end points. You can ignore that.
>
> I am looking at piece 10 however to see why it's smaller:
> ah. I have a
> typo in the size for that one, I asked for only 200000 pages to
> go in it
> instead of the 240000 I intended :-D And so that's all
> that went in
> (minus deleted pages). Nothing's missing though;
> anything "extra"
> winds up in the last piece (15). You can look at the stub
> files to
> verify that.
>
> FWIW we'll be juggling the number of pages per chunk on a
> regular basis.
>
> Ariel
>
> Στις 29-03-2011, ημέρα Τρι, και ώρα 17:08 -0700, ο/η Jamie Morken
> έγραψε:
> > Hi all,
> >
> > Congrats Ariel! :) The sum of pages-meta-history files
> for the last
> > two enwiki dumps are 342.7GB for the 20110115 dump and 353.5GB
> for the
> > 20110317 dump, which shows that the overall dump size grew
> over 2
> > months. Seven of the individually numbered pages-meta-
> history files
> > reduced in size while eight increased in size from 20110115 to
> > 20110317. By far the biggest decrease was the
> > pages-meta-history10.xml.bz2 file which dropped from 18.7GB
> down to
> > 1.9GB. I think there is probably missing revisions in
> that page ID
> > range.
> >
> > Here are some historical dumps sizes for comparison to show
> the growth
> > of these files:
> >
> > enwiki-20060816-pages-meta-history.xml.7z 5.08GB
> > enwiki-20070402-pages-meta-history.xml.7z 11.3GB (229 days since
> > previous dump)
> > enwiki-20080103-pages-meta-history.xml.7z 17.2GB (276 days since
> > previous dump)
> > enwiki-20100130-pages-meta-history.xml.7z 31.8GB (758 days since
> > previous dump)
> > enwiki-20110115-pages-meta-history[1-15].xml.7z 38.0GB (350
> days since
> > previous dump)
> > enwiki-20110115-pages-meta-history[1-15].xml.7z (7z
> compression in
> > progress)
> >
> > Here's a graph of this data showing the dump file size growth
> seems to
> > be pretty linear:
> > (chart x-axis starts from 20060816 dump and ends at 20110115 dump)
> > "http://nekrom.com/wikipedia/enwiki%20history%20dump%20file%20size%
> > 20over%20time.png"
> >
> > cheers,
> > Jamie
> >
> >
> > ----- Original Message -----
> > From: "Ariel T. Glenn" <[email protected]>
> > Date: Tuesday, March 29, 2011 3:24 pm
> > Subject: [Xmldatadumps-l] March 17 en wikipedia history bz2 files
> > ready
> > To: [email protected]
> > Cc: [email protected]
> >
> > > Well, that used up all my good luck for the year, but the
> bz2s
> > > are ready
> > > for download. The md5sums are still calculating, give
> them
> > > a couple
> > > hours to show up. If all continues to go well we'll
> have
> > > the 7z files
> > > in 4-5 days.
> > >
> > > As before I do not plan to provide a single 350gb file of
> the
> > > bz2, nor a
> > > single 7z file for download.
> > >
> > > Happy trails,
> > >
> > > Ariel
> > >
> > >
> > > _______________________________________________
> > > Xmldatadumps-l mailing list
> > > [email protected]
> > > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> > >
>
>
>
_______________________________________________
Xmldatadumps-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
--- End Message ---
_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l