Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-18 Thread Mike Dupont
Hello People,
I have completed my first set in uploading the osm/fosm dataset (350gb
unpacked) to archive.org
http://osmopenlayers.blogspot.de/2012/05/upload-finished.html

We can do something similar with wikipedia, the bucket size of
archive.org is 10gb, we need to split up the data in a way that it is
useful. I have done this by putting each object on one line and each
file contains the full data records and the parts that belong to the
previous block and next block, so you are able to process the blocks
almost stand alone.

mike

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-18 Thread emijrp
There is no such 10GB limit,
http://archive.org/details/ARCHIVETEAM-YV-6360017-6399947 (238 GB example)

ArchiveTeam/WikiTeam is uploading some dumps to Internet Archive, if you
want to join the effort use the mailing list
https://groups.google.com/group/wikiteam-discuss to avoid wasting resources.

2012/5/18 Mike Dupont jamesmikedup...@googlemail.com

 Hello People,
 I have completed my first set in uploading the osm/fosm dataset (350gb
 unpacked) to archive.org
 http://osmopenlayers.blogspot.de/2012/05/upload-finished.html

 We can do something similar with wikipedia, the bucket size of
 archive.org is 10gb, we need to split up the data in a way that it is
 useful. I have done this by putting each object on one line and each
 file contains the full data records and the parts that belong to the
 previous block and next block, so you are able to process the blocks
 almost stand alone.

 mike

 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l




-- 
Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
Pre-doctoral student at the University of Cádiz (Spain)
Projects: AVBOT http://code.google.com/p/avbot/ |
StatMediaWikihttp://statmediawiki.forja.rediris.es
| WikiEvidens http://code.google.com/p/wikievidens/ |
WikiPapershttp://wikipapers.referata.com
| WikiTeam http://code.google.com/p/wikiteam/
Personal website: https://sites.google.com/site/emijrp/
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-17 Thread Mike Dupont
On Thu, May 17, 2012 at 6:06 AM, John phoenixoverr...@gmail.com wrote:
 If your willing to foot the bill for the new hardware
 Ill gladly prove my point

given the millions of dollars that wikipedia has, it should not be a
problem to provide such resources for a good cause like that.

-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-17 Thread J Alexandr Ledbury-Romanov
I'd like to point out that the increasingly technical nature of this
conversation probably belongs either on wikitech-l, or off-list, and that
the strident nature of the comments is fast approaching inappropriate.

Alex
Wikimedia-l list administrator


2012/5/17 Anthony wikim...@inbox.org

 On Thu, May 17, 2012 at 2:06 AM, John phoenixoverr...@gmail.com wrote:
  On Thu, May 17, 2012 at 1:52 AM, Anthony wikim...@inbox.org wrote:
  On Thu, May 17, 2012 at 1:22 AM, John phoenixoverr...@gmail.com
 wrote:
   Anthony the process is linear, you have a php inserting X number of
 rows
   per
   Y time frame.
 
  Amazing.  I need to switch all my databases to MySQL.  It can insert X
  rows per Y time frame, regardless of whether the database is 20
  gigabytes or 20 terabytes in size, regardless of whether the average
  row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
  RAID array or a cluster of servers, etc.
 
  When refering to X over Y time, its an average of a of say 1000 revisions
  per 1 minute, any X over Y period must be considered with averages in
 mind,
  or getting a count wouldnt be possible.

 The *average* en.wikipedia revision is more than twice the size of the
 *average* simple.wikipedia revision.  The *average* performance of a
 20 gig database is faster than the *average* performance of a 20
 terabyte database.  The *average* performance of your laptop's thumb
 drive is different from the *average* performance of a(n array of)
 drive(s) which can handle 20 terabytes of data.

  If you setup your sever/hardware correctly it will compress the text
  information during insertion into the database

 Is this how you set up your simple.wikipedia test?  How long does it
 take import the data if you're using the same compression mechanism as
 WMF (which, you didn't answer, but I assume is concatenation and
 compression).  How exactly does this work during insertion anyway?
 Does it intelligently group sets of revisions together to avoid
 decompressing and recompressing the same revision several times?  I
 suppose it's possible, but that would introduce quite a lot of
 complication into the import script, slowing things down dramatically.

 What about the answers to my other questions?

  If you want to put your money where your mouth is, import
  en.wikipedia.  It'll only take 5 days, right?
 
  If I actually had a server or the disc space to do it I would, just to
 prove
  your smartass comments as stupid as they actually are. However given my
  current resource limitations (fairly crappy internet connection, older
  laptops, and lack of HDD) I tried to select something that could give
  reliable benchmarks. If your willing to foot the bill for the new
 hardware
  Ill gladly prove my point

 What you seem to be saying is that you're *not* putting your money
 where your mouth is.

 Anyway, if you want, I'll make a deal with you.  A neutral third party
 rents the hardware at Amazon Web Services (AWS).  We import
 simple.wikipedia full history (concatenating and compressing during
 import).  We take the ratio of revisions in simple.wikipedia to the
 ratio of revisions in en.wikipedia.  We import en.wikipedia full
 history (concatenating and compressing during import).  If the ratio
 of time it takes to import en.wikipedia vs simple.wikipedia is greater
 than or equal to twice the ratio of revisions, then you reimburse the
 third party.  If the ratio of import time is less than twice the ratio
 of revisions (you claim it is linear, therefore it'll be the same
 ratio), then I reimburse the third party.

 Either way, we save the new dump, with the processing already done,
 and send it to archive.org (and WMF if they're willing to host it).
 So we actually get a useful result out of this.  It's not just for the
 purpose of settling an argument.

 Either of us can concede defeat at any point, and stop the experiment.
  At that point if the neutral third party wishes to pay to continue
 the job, s/he would be responsible for the additional costs.

 Shouldn't be too expensive.  If you concede defeat after 5 days, then
 your CPU-time costs are $54 (assuming Extra Large High Memory
 Instance).  Including 4 terabytes of EBS (which should be enough if
 you compress on the fly) for 5 days should be less than $100.

 I'm tempted to do it even if you don't take the bet.

 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-17 Thread Neil Harris

On 17/05/12 12:49, Anthony wrote:

Please have someone at WMF coordinate this so that there aren't
multiple requests made.  In my opinion, it should preferably be made
by a WMF employee.

Fill out the form at
https://aws-portal.amazon.com/gp/aws/html-forms-controller/aws-dataset-inquiry

Tell them you want to create a public data set which is a snapshot of
the English Wikipedia.  We can coordinate any questions, and any
implementation details, on a separate list.



That's a fantastic idea, and would give en: Wikipedia yet another public 
replica for very little effort. I would imagine that if they are willing 
to host enwiki, they may also be be willing to host most, or all, of the 
other projects.


It will also mean that running Wikipedia data-munching experiments on 
EC2 will become much easier.


Neil


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-17 Thread Kim Bruning
On Thu, May 17, 2012 at 07:43:09AM -0400, Anthony wrote:
 
 In fact, I think someone at WMF should contact Amazon and see if
 they'll let us conduct the experiment for free, in exchange for us
 creating the dump for them to host as a public data set
 (http://aws.amazon.com/publicdatasets/).


That sounds like an excellent plan. At the same time, it might be useful to get 
Archive Team
involved. 

* They have warm bodies. (always useful, one can never have enough volunteers ;)
* They have experience with very large datasets
* They'd be very happy to help (it's their mission)
* Some of them may be able to provide Sufficient Storage(tm) and server 
capacity. Saves us
the Amazon AWS bill. 
* We might set a precedent where others might provide their data to AT directly 
too.

AT's mission dovetails nicely with ours. We provide the sum of all human 
knowledge to people.
AT ensures that the sum of all human knowledge is not subtracted from.


sincerely,
Kim Bruning

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread John
Ill run a quick benchmark and import the full history of simple.wikipedia
to my laptop wiki on a stick, and give an exact duration


On Thu, May 17, 2012 at 12:26 AM, John phoenixoverr...@gmail.com wrote:

 Toolserver is a clone of the wmf servers minus files. they run a database
 replication of all wikis. these times are dependent on available hardware
 and may very, but should provide a decent estimate



 On Thu, May 17, 2012 at 12:23 AM, Anthony wikim...@inbox.org wrote:

 On Thu, May 17, 2012 at 12:18 AM, John phoenixoverr...@gmail.com wrote:
  take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumpsfor
  exactly how to import an existing dump, I know the process of
 re-importing
  a cluster for the toolserver is normally just a few days when they have
 the
  needed dumps.

 Toolserver doesn't have full history, does it?



___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread Anthony
On Thu, May 17, 2012 at 12:30 AM, John phoenixoverr...@gmail.com wrote:
 Ill run a quick benchmark and import the full history of simple.wikipedia to
 my laptop wiki on a stick, and give an exact duration

Simple.wikipedia is nothing like en.wikipedia.  For one thing, there's
no need to turn on $wgCompressRevisions with simple.wikipedia.

Is $wgCompressRevisions still used?  I haven't followed this in quite a while.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread Mike Dupont
Well to be honest, I am still upset about how much data is deleted
from wikipedia because it is not notable,
there are so many articles that I might be interested in that are lost
in the same garbage as spam and other things.
We should make non notable articles and non harmful ones available in
the backups as well.
mike

On Thu, May 17, 2012 at 2:28 AM, Kim Bruning k...@bruning.xs4all.nl wrote:
 On Wed, May 16, 2012 at 11:11:04PM -0400, John wrote:
 I know from experience that a wiki can be re-built from any one of the
 dumps that are provided, (pages-meta-current) for example contains
 everything needed to reboot a site except its user database
 (names/passwords ect). see
 http://www.mediawiki.org/wiki/Manual:Moving_a_wiki


 Sure. Does this include all images, including commons images, eventually
 converted to operate locally?

 I'm thinking about full snapshot-and-later-restore, say 25 or 50 years
 from now, or in an academic setting, (or FSM-forbid in a worst case scenario
 knock on wood). That's what the AT folks are most interested in.

 ==Fire Drill==
 Has anyone recently set up a full-external-duplicate of (for instance) en.wp?
 This includes all images, all discussions, all page history (excepting the 
 user
 accounts and deleted pages)

 This would be a useful and important exercise; possibly to be repeated once 
 per year.

 I get a sneaky feeling that the first few iterations won't go so well.

 I'm sure AT would be glad to help out with the running of these fire drills, 
 as
 it seems to be in line with their mission.

 sincerely,
        Kim Bruning

 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l



-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread Anthony
On Thu, May 17, 2012 at 1:22 AM, John phoenixoverr...@gmail.com wrote:
 Anthony the process is linear, you have a php inserting X number of rows per
 Y time frame.

Amazing.  I need to switch all my databases to MySQL.  It can insert X
rows per Y time frame, regardless of whether the database is 20
gigabytes or 20 terabytes in size, regardless of whether the average
row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
RAID array or a cluster of servers, etc.

 Yes rebuilding the externallinks, links, and langlinks tables
 will take some additional time and wont scale.

And this is part of the process too, right?

 However I have been working
 with the toolserver since 2007 and Ive lost count of the number of times
 that the TS has needed to re-import a cluster, (s1-s7) and even enwiki can
 be done in a semi-reasonable timeframe.

Re-importing how?  From the compressed XML full history dumps?

 The WMF actually compresses all text
 blobs not just old versions.

Is http://www.mediawiki.org/wiki/Manual:Text_table still accurate?  Is
WMF using gzip or object?

 complete download and decompression of simple
 only took 20 minutes on my 2 year old consumer grade laptop with a standard
 home cable internet connection, same download on the toolserver (minus
 decompression) was 88s. Yeah Importing will take a little longer but
 shouldnt be that big of a deal.

For the full history English Wikipedia it *is* a big deal.

If you think it isn't, stop playing with simple.wikipedia, and tell us
how long it takes to get a mirror up and running of en.wikipedia.

Do you plan to run compressOld.php?  Are you going to import
everything in plain text first, and *then* start compressing?  Seems
like an awful lot of wasted hard drive space.

 There will also be some need cleanup tasks.
 However the main issue, archiving and restoring wmf wikis isnt an issue, and
 with moderately recent hardware is no big deal. Im putting my money where my
 mouth is, and getting actual valid stats and figures. Yes it may not be an
 exactly 1:1 ratio when scaling up, however given the basics of how importing
 a dump functions it should remain close to the same ratio

If you want to put your money where your mouth is, import
en.wikipedia.  It'll only take 5 days, right?

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l