Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-18 Thread Mike Dupont
there is no 10gb limit, but it is the recommended bucket size if you
want to split up the file, according to my recent discussion with  the
archive.org team, and they have been helping me optimize the storage.
the idea of mine is to make smaller blocks that can be fetched quickly
and that people for example reading an article could just load the
data needed to display would be availab le via json(p) or xml/text
from a file.
we can make the wikipedia in a read only mode hosted totallz on the
archive org without a database server by encoding the search binary
trees in json data stored also on archive org, the clients can perform
the searches themselves.
that is my current research on fosm.org and i hope it can apply to the
wikipedia as well.
mike

On Fri, May 18, 2012 at 9:41 AM, emijrp  wrote:
> There is no such 10GB limit,
> http://archive.org/details/ARCHIVETEAM-YV-6360017-6399947 (238 GB example)
>
> ArchiveTeam/WikiTeam is uploading some dumps to Internet Archive, if you
> want to join the effort use the mailing list
> https://groups.google.com/group/wikiteam-discuss to avoid wasting resources.
>
> 2012/5/18 Mike Dupont 
>
>> Hello People,
>> I have completed my first set in uploading the osm/fosm dataset (350gb
>> unpacked) to archive.org
>> http://osmopenlayers.blogspot.de/2012/05/upload-finished.html
>>
>> We can do something similar with wikipedia, the bucket size of
>> archive.org is 10gb, we need to split up the data in a way that it is
>> useful. I have done this by putting each object on one line and each
>> file contains the full data records and the parts that belong to the
>> previous block and next block, so you are able to process the blocks
>> almost stand alone.
>>
>> mike
>>
>> ___
>> Wikimedia-l mailing list
>> Wikimedia-l@lists.wikimedia.org
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>>
>
>
>
> --
> Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
> Pre-doctoral student at the University of Cádiz (Spain)
> Projects: AVBOT  |
> StatMediaWiki
> | WikiEvidens  |
> WikiPapers
> | WikiTeam 
> Personal website: https://sites.google.com/site/emijrp/
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l



-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org
Contributor FOSM, the CC-BY-SA map of the world http://fosm.org
Mozilla Rep https://reps.mozilla.org/u/h4ck3rm1k3

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-18 Thread emijrp
There is no such 10GB limit,
http://archive.org/details/ARCHIVETEAM-YV-6360017-6399947 (238 GB example)

ArchiveTeam/WikiTeam is uploading some dumps to Internet Archive, if you
want to join the effort use the mailing list
https://groups.google.com/group/wikiteam-discuss to avoid wasting resources.

2012/5/18 Mike Dupont 

> Hello People,
> I have completed my first set in uploading the osm/fosm dataset (350gb
> unpacked) to archive.org
> http://osmopenlayers.blogspot.de/2012/05/upload-finished.html
>
> We can do something similar with wikipedia, the bucket size of
> archive.org is 10gb, we need to split up the data in a way that it is
> useful. I have done this by putting each object on one line and each
> file contains the full data records and the parts that belong to the
> previous block and next block, so you are able to process the blocks
> almost stand alone.
>
> mike
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>



-- 
Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
Pre-doctoral student at the University of Cádiz (Spain)
Projects: AVBOT  |
StatMediaWiki
| WikiEvidens  |
WikiPapers
| WikiTeam 
Personal website: https://sites.google.com/site/emijrp/
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-18 Thread Mike Dupont
Hello People,
I have completed my first set in uploading the osm/fosm dataset (350gb
unpacked) to archive.org
http://osmopenlayers.blogspot.de/2012/05/upload-finished.html

We can do something similar with wikipedia, the bucket size of
archive.org is 10gb, we need to split up the data in a way that it is
useful. I have done this by putting each object on one line and each
file contains the full data records and the parts that belong to the
previous block and next block, so you are able to process the blocks
almost stand alone.

mike

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-17 Thread Kim Bruning
On Thu, May 17, 2012 at 07:43:09AM -0400, Anthony wrote:
> 
> In fact, I think someone at WMF should contact Amazon and see if
> they'll let us conduct the experiment for free, in exchange for us
> creating the dump for them to host as a public data set
> (http://aws.amazon.com/publicdatasets/).


That sounds like an excellent plan. At the same time, it might be useful to get 
Archive Team
involved. 

* They have warm bodies. (always useful, one can never have enough volunteers ;)
* They have experience with very large datasets
* They'd be very happy to help (it's their mission)
* Some of them may be able to provide Sufficient Storage(tm) and server 
capacity. Saves us
the Amazon AWS bill. 
* We might set a precedent where others might provide their data to AT directly 
too.

AT's mission dovetails nicely with ours. We provide the sum of all human 
knowledge to people.
AT ensures that the sum of all human knowledge is not subtracted from.


sincerely,
Kim Bruning

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-17 Thread Neil Harris

On 17/05/12 12:49, Anthony wrote:

Please have someone at WMF coordinate this so that there aren't
multiple requests made.  In my opinion, it should preferably be made
by a WMF employee.

Fill out the form at
https://aws-portal.amazon.com/gp/aws/html-forms-controller/aws-dataset-inquiry

Tell them you want to create a public data set which is a snapshot of
the English Wikipedia.  We can coordinate any questions, and any
implementation details, on a separate list.



That's a fantastic idea, and would give en: Wikipedia yet another public 
replica for very little effort. I would imagine that if they are willing 
to host enwiki, they may also be be willing to host most, or all, of the 
other projects.


It will also mean that running Wikipedia data-munching experiments on 
EC2 will become much easier.


Neil


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-17 Thread Anthony
On Thu, May 17, 2012 at 8:11 AM, Thomas Dalton  wrote:
> On 17 May 2012 12:43, Anthony  wrote:
>> In fact, I think someone at WMF should contact Amazon and see if
>> they'll let us conduct the experiment for free, in exchange for us
>> creating the dump for them to host as a public data set
>> (http://aws.amazon.com/publicdatasets/).
>
> What dump are you going to create? You are starting from a dump, why
> can't Amazon just host that?

Because the XML dump is semi-useless - it's compressed in all the
wrong places to use for an actual running system.

Anyway, looking at how the AWS Public Data Sets work, it probably
would be best not to even create a dump, but just put up the running
(object compressed) database.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-17 Thread Thomas Dalton
On 17 May 2012 12:43, Anthony  wrote:
> In fact, I think someone at WMF should contact Amazon and see if
> they'll let us conduct the experiment for free, in exchange for us
> creating the dump for them to host as a public data set
> (http://aws.amazon.com/publicdatasets/).

What dump are you going to create? You are starting from a dump, why
can't Amazon just host that?

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-17 Thread Anthony
Please have someone at WMF coordinate this so that there aren't
multiple requests made.  In my opinion, it should preferably be made
by a WMF employee.

Fill out the form at
https://aws-portal.amazon.com/gp/aws/html-forms-controller/aws-dataset-inquiry

Tell them you want to create a public data set which is a snapshot of
the English Wikipedia.  We can coordinate any questions, and any
implementation details, on a separate list.

On Thu, May 17, 2012 at 7:43 AM, Anthony  wrote:
> On Thu, May 17, 2012 at 7:27 AM, J Alexandr Ledbury-Romanov
>  wrote:
>> I'd like to point out that the increasingly technical nature of this
>> conversation probably belongs either on wikitech-l, or off-list, and that
>> the strident nature of the comments is fast approaching inappropriate.
>
> Really?  I think we're really getting somewhere.
>
> In fact, I think someone at WMF should contact Amazon and see if
> they'll let us conduct the experiment for free, in exchange for us
> creating the dump for them to host as a public data set
> (http://aws.amazon.com/publicdatasets/).
>
> In case you got lost in the technical details, the original post was
> asking "Has anyone recently set up a full-external-duplicate of (for
> instance) en.wp?" and suggesting that we should do this on a yearly
> basis as a fire drill.
>
> My latest post was a concrete proposal for doing exactly that.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-17 Thread Anthony
On Thu, May 17, 2012 at 7:27 AM, J Alexandr Ledbury-Romanov
 wrote:
> I'd like to point out that the increasingly technical nature of this
> conversation probably belongs either on wikitech-l, or off-list, and that
> the strident nature of the comments is fast approaching inappropriate.

Really?  I think we're really getting somewhere.

In fact, I think someone at WMF should contact Amazon and see if
they'll let us conduct the experiment for free, in exchange for us
creating the dump for them to host as a public data set
(http://aws.amazon.com/publicdatasets/).

In case you got lost in the technical details, the original post was
asking "Has anyone recently set up a full-external-duplicate of (for
instance) en.wp?" and suggesting that we should do this on a yearly
basis as a fire drill.

My latest post was a concrete proposal for doing exactly that.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-17 Thread J Alexandr Ledbury-Romanov
I'd like to point out that the increasingly technical nature of this
conversation probably belongs either on wikitech-l, or off-list, and that
the strident nature of the comments is fast approaching inappropriate.

Alex
Wikimedia-l list administrator


2012/5/17 Anthony 

> On Thu, May 17, 2012 at 2:06 AM, John  wrote:
> > On Thu, May 17, 2012 at 1:52 AM, Anthony  wrote:
> >> On Thu, May 17, 2012 at 1:22 AM, John 
> wrote:
> >> > Anthony the process is linear, you have a php inserting X number of
> rows
> >> > per
> >> > Y time frame.
> >>
> >> Amazing.  I need to switch all my databases to MySQL.  It can insert X
> >> rows per Y time frame, regardless of whether the database is 20
> >> gigabytes or 20 terabytes in size, regardless of whether the average
> >> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
> >> RAID array or a cluster of servers, etc.
> >
> > When refering to X over Y time, its an average of a of say 1000 revisions
> > per 1 minute, any X over Y period must be considered with averages in
> mind,
> > or getting a count wouldnt be possible.
>
> The *average* en.wikipedia revision is more than twice the size of the
> *average* simple.wikipedia revision.  The *average* performance of a
> 20 gig database is faster than the *average* performance of a 20
> terabyte database.  The *average* performance of your laptop's thumb
> drive is different from the *average* performance of a(n array of)
> drive(s) which can handle 20 terabytes of data.
>
> > If you setup your sever/hardware correctly it will compress the text
> > information during insertion into the database
>
> Is this how you set up your simple.wikipedia test?  How long does it
> take import the data if you're using the same compression mechanism as
> WMF (which, you didn't answer, but I assume is concatenation and
> compression).  How exactly does this work "during insertion" anyway?
> Does it intelligently group sets of revisions together to avoid
> decompressing and recompressing the same revision several times?  I
> suppose it's possible, but that would introduce quite a lot of
> complication into the import script, slowing things down dramatically.
>
> What about the answers to my other questions?
>
> >> If you want to put your money where your mouth is, import
> >> en.wikipedia.  It'll only take 5 days, right?
> >
> > If I actually had a server or the disc space to do it I would, just to
> prove
> > your smartass comments as stupid as they actually are. However given my
> > current resource limitations (fairly crappy internet connection, older
> > laptops, and lack of HDD) I tried to select something that could give
> > reliable benchmarks. If your willing to foot the bill for the new
> hardware
> > Ill gladly prove my point
>
> What you seem to be saying is that you're *not* putting your money
> where your mouth is.
>
> Anyway, if you want, I'll make a deal with you.  A neutral third party
> rents the hardware at Amazon Web Services (AWS).  We import
> simple.wikipedia full history (concatenating and compressing during
> import).  We take the ratio of revisions in simple.wikipedia to the
> ratio of revisions in en.wikipedia.  We import en.wikipedia full
> history (concatenating and compressing during import).  If the ratio
> of time it takes to import en.wikipedia vs simple.wikipedia is greater
> than or equal to twice the ratio of revisions, then you reimburse the
> third party.  If the ratio of import time is less than twice the ratio
> of revisions (you claim it is linear, therefore it'll be the same
> ratio), then I reimburse the third party.
>
> Either way, we save the new dump, with the processing already done,
> and send it to archive.org (and WMF if they're willing to host it).
> So we actually get a useful result out of this.  It's not just for the
> purpose of settling an argument.
>
> Either of us can concede defeat at any point, and stop the experiment.
>  At that point if the neutral third party wishes to pay to continue
> the job, s/he would be responsible for the additional costs.
>
> Shouldn't be too expensive.  If you concede defeat after 5 days, then
> your CPU-time costs are $54 (assuming Extra Large High Memory
> Instance).  Including 4 terabytes of EBS (which should be enough if
> you compress on the fly) for 5 days should be less than $100.
>
> I'm tempted to do it even if you don't take the bet.
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-17 Thread Anthony
On Thu, May 17, 2012 at 2:06 AM, John  wrote:
> On Thu, May 17, 2012 at 1:52 AM, Anthony  wrote:
>> On Thu, May 17, 2012 at 1:22 AM, John  wrote:
>> > Anthony the process is linear, you have a php inserting X number of rows
>> > per
>> > Y time frame.
>>
>> Amazing.  I need to switch all my databases to MySQL.  It can insert X
>> rows per Y time frame, regardless of whether the database is 20
>> gigabytes or 20 terabytes in size, regardless of whether the average
>> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
>> RAID array or a cluster of servers, etc.
>
> When refering to X over Y time, its an average of a of say 1000 revisions
> per 1 minute, any X over Y period must be considered with averages in mind,
> or getting a count wouldnt be possible.

The *average* en.wikipedia revision is more than twice the size of the
*average* simple.wikipedia revision.  The *average* performance of a
20 gig database is faster than the *average* performance of a 20
terabyte database.  The *average* performance of your laptop's thumb
drive is different from the *average* performance of a(n array of)
drive(s) which can handle 20 terabytes of data.

> If you setup your sever/hardware correctly it will compress the text
> information during insertion into the database

Is this how you set up your simple.wikipedia test?  How long does it
take import the data if you're using the same compression mechanism as
WMF (which, you didn't answer, but I assume is concatenation and
compression).  How exactly does this work "during insertion" anyway?
Does it intelligently group sets of revisions together to avoid
decompressing and recompressing the same revision several times?  I
suppose it's possible, but that would introduce quite a lot of
complication into the import script, slowing things down dramatically.

What about the answers to my other questions?

>> If you want to put your money where your mouth is, import
>> en.wikipedia.  It'll only take 5 days, right?
>
> If I actually had a server or the disc space to do it I would, just to prove
> your smartass comments as stupid as they actually are. However given my
> current resource limitations (fairly crappy internet connection, older
> laptops, and lack of HDD) I tried to select something that could give
> reliable benchmarks. If your willing to foot the bill for the new hardware
> Ill gladly prove my point

What you seem to be saying is that you're *not* putting your money
where your mouth is.

Anyway, if you want, I'll make a deal with you.  A neutral third party
rents the hardware at Amazon Web Services (AWS).  We import
simple.wikipedia full history (concatenating and compressing during
import).  We take the ratio of revisions in simple.wikipedia to the
ratio of revisions in en.wikipedia.  We import en.wikipedia full
history (concatenating and compressing during import).  If the ratio
of time it takes to import en.wikipedia vs simple.wikipedia is greater
than or equal to twice the ratio of revisions, then you reimburse the
third party.  If the ratio of import time is less than twice the ratio
of revisions (you claim it is linear, therefore it'll be the same
ratio), then I reimburse the third party.

Either way, we save the new dump, with the processing already done,
and send it to archive.org (and WMF if they're willing to host it).
So we actually get a useful result out of this.  It's not just for the
purpose of settling an argument.

Either of us can concede defeat at any point, and stop the experiment.
 At that point if the neutral third party wishes to pay to continue
the job, s/he would be responsible for the additional costs.

Shouldn't be too expensive.  If you concede defeat after 5 days, then
your CPU-time costs are $54 (assuming Extra Large High Memory
Instance).  Including 4 terabytes of EBS (which should be enough if
you compress on the fly) for 5 days should be less than $100.

I'm tempted to do it even if you don't take the bet.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread Mike Dupont
On Thu, May 17, 2012 at 6:06 AM, John  wrote:
> If your willing to foot the bill for the new hardware
> Ill gladly prove my point

given the millions of dollars that wikipedia has, it should not be a
problem to provide such resources for a good cause like that.

-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread John
On Thu, May 17, 2012 at 1:52 AM, Anthony  wrote:

> On Thu, May 17, 2012 at 1:22 AM, John  wrote:
> > Anthony the process is linear, you have a php inserting X number of rows
> per
> > Y time frame.
>
> Amazing.  I need to switch all my databases to MySQL.  It can insert X
> rows per Y time frame, regardless of whether the database is 20
> gigabytes or 20 terabytes in size, regardless of whether the average
> row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
> RAID array or a cluster of servers, etc.
>

When refering to X over Y time, its an average of a of say 1000 revisions
per 1 minute, any X over Y period must be considered with averages in mind,
or getting a count wouldnt be possible.



> > Yes rebuilding the externallinks, links, and langlinks tables
> > will take some additional time and wont scale.
>
> And this is part of the process too, right?

That does not need to be completed prior to the site going live, it can be
done after making it public

> That part isnt
> > However I have been working
> > with the toolserver since 2007 and Ive lost count of the number of times
> > that the TS has needed to re-import a cluster, (s1-s7) and even enwiki
> can
> > be done in a semi-reasonable timeframe.
>
> Re-importing how?  From the compressed XML full history dumps?


> > The WMF actually compresses all text
> > blobs not just old versions.
>
> Is http://www.mediawiki.org/wiki/Manual:Text_table still accurate?  Is
> WMF using gzip or object?
>
> > complete download and decompression of simple
> > only took 20 minutes on my 2 year old consumer grade laptop with a
> standard
> > home cable internet connection, same download on the toolserver (minus
> > decompression) was 88s. Yeah Importing will take a little longer but
> > shouldnt be that big of a deal.
>
> For the full history English Wikipedia it *is* a big deal.
>
> If you think it isn't, stop playing with simple.wikipedia, and tell us
> how long it takes to get a mirror up and running of en.wikipedia.
>
> Do you plan to run compressOld.php?  Are you going to import
> everything in plain text first, and *then* start compressing?  Seems
> like an awful lot of wasted hard drive space.
>

If you setup your sever/hardware correctly it will compress the text
information during insertion into the database and compressOld.php is
actually designed only for cases where you start with an uncompressed
configuration


> > There will also be some need cleanup tasks.
> > However the main issue, archiving and restoring wmf wikis isnt an issue,
> and
> > with moderately recent hardware is no big deal. Im putting my money
> where my
> > mouth is, and getting actual valid stats and figures. Yes it may not be
> an
> > exactly 1:1 ratio when scaling up, however given the basics of how
> importing
> > a dump functions it should remain close to the same ratio
>
> If you want to put your money where your mouth is, import
> en.wikipedia.  It'll only take 5 days, right?
>

If I actually had a server or the disc space to do it I would, just to
prove your smartass comments as stupid as they actually are. However given
my current resource limitations (fairly crappy internet connection, older
laptops, and lack of HDD) I tried to select something that could give
reliable benchmarks. If your willing to foot the bill for the new hardware
Ill gladly prove my point
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread Anthony
On Thu, May 17, 2012 at 1:22 AM, John  wrote:
> Anthony the process is linear, you have a php inserting X number of rows per
> Y time frame.

Amazing.  I need to switch all my databases to MySQL.  It can insert X
rows per Y time frame, regardless of whether the database is 20
gigabytes or 20 terabytes in size, regardless of whether the average
row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a
RAID array or a cluster of servers, etc.

> Yes rebuilding the externallinks, links, and langlinks tables
> will take some additional time and wont scale.

And this is part of the process too, right?

> However I have been working
> with the toolserver since 2007 and Ive lost count of the number of times
> that the TS has needed to re-import a cluster, (s1-s7) and even enwiki can
> be done in a semi-reasonable timeframe.

Re-importing how?  From the compressed XML full history dumps?

> The WMF actually compresses all text
> blobs not just old versions.

Is http://www.mediawiki.org/wiki/Manual:Text_table still accurate?  Is
WMF using gzip or object?

> complete download and decompression of simple
> only took 20 minutes on my 2 year old consumer grade laptop with a standard
> home cable internet connection, same download on the toolserver (minus
> decompression) was 88s. Yeah Importing will take a little longer but
> shouldnt be that big of a deal.

For the full history English Wikipedia it *is* a big deal.

If you think it isn't, stop playing with simple.wikipedia, and tell us
how long it takes to get a mirror up and running of en.wikipedia.

Do you plan to run compressOld.php?  Are you going to import
everything in plain text first, and *then* start compressing?  Seems
like an awful lot of wasted hard drive space.

> There will also be some need cleanup tasks.
> However the main issue, archiving and restoring wmf wikis isnt an issue, and
> with moderately recent hardware is no big deal. Im putting my money where my
> mouth is, and getting actual valid stats and figures. Yes it may not be an
> exactly 1:1 ratio when scaling up, however given the basics of how importing
> a dump functions it should remain close to the same ratio

If you want to put your money where your mouth is, import
en.wikipedia.  It'll only take 5 days, right?

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread Mike Dupont
Well to be honest, I am still upset about how much data is deleted
from wikipedia because it is not "notable",
there are so many articles that I might be interested in that are lost
in the same garbage as spam and other things.
We should make non notable articles and non harmful ones available in
the backups as well.
mike

On Thu, May 17, 2012 at 2:28 AM, Kim Bruning  wrote:
> On Wed, May 16, 2012 at 11:11:04PM -0400, John wrote:
>> I know from experience that a wiki can be re-built from any one of the
>> dumps that are provided, (pages-meta-current) for example contains
>> everything needed to reboot a site except its user database
>> (names/passwords ect). see
>> http://www.mediawiki.org/wiki/Manual:Moving_a_wiki
>
>
> Sure. Does this include all images, including commons images, eventually
> converted to operate locally?
>
> I'm thinking about full snapshot-and-later-restore, say 25 or 50 years
> from now, or in an academic setting, (or FSM-forbid in a worst case scenario
> ). That's what the AT folks are most interested in.
>
> ==Fire Drill==
> Has anyone recently set up a full-external-duplicate of (for instance) en.wp?
> This includes all images, all discussions, all page history (excepting the 
> user
> accounts and deleted pages)
>
> This would be a useful and important exercise; possibly to be repeated once 
> per year.
>
> I get a sneaky feeling that the first few iterations won't go so well.
>
> I'm sure AT would be glad to help out with the running of these fire drills, 
> as
> it seems to be in line with their mission.
>
> sincerely,
>        Kim Bruning
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l



-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread John
Anthony the process is linear, you have a php inserting X number of rows
per Y time frame. Yes rebuilding the externallinks, links, and langlinks
tables will take some additional time and wont scale. However I have been
working with the toolserver since 2007 and Ive lost count of the number of
times that the TS has needed to re-import a cluster, (s1-s7) and even
enwiki can be done in a semi-reasonable timeframe. The WMF actually
compresses all text blobs not just old versions. complete download and
decompression of simple only took 20 minutes on my 2 year old consumer
grade laptop with a standard home cable internet connection, same download
on the toolserver (minus decompression) was 88s. Yeah Importing will take a
little longer but shouldnt be that big of a deal. There will also be some
need cleanup tasks. However the main issue, archiving and restoring wmf
wikis isnt an issue, and with moderately recent hardware is no big deal. Im
putting my money where my mouth is, and getting actual valid stats and
figures. Yes it may not be an exactly 1:1 ratio when scaling up, however
given the basics of how importing a dump functions it should remain close
to the same ratio

On Thu, May 17, 2012 at 12:54 AM, Anthony  wrote:

> On Thu, May 17, 2012 at 12:45 AM, John  wrote:
> > Simple.wikipedia is nothing like en.wikipedia I care to dispute that
> > statement, All WMF wikis are setup basically the same (an odd extension
> here
> > or there is different, and different namespace names at times) but for
> the
> > purpose of recovery simplewiki_p is a very standard example. this issue
> isnt
> > just about enwiki_p but *all* wmf wikis. Doing a data recovery for
> enwiki vs
> > simplewiki is just a matter of time, for enwiki a 5 day estimate would be
> > fairly standard (depending on server setup) and lower times for smaller
> > databases. typically you can explain it in a rate of X revisions
> processed
> > per Y time unit, regardless of the project. and that rate should be
> similar
> > for everything given the same hardware setup.
>
> Are you compressing old revisions, or not?  Does the WMF database
> compress old revisions, or not?
>
> In any case, I'm sorry, a 20 gig mysql database does not scale
> linearly to a 20 terabyte mysql database.
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread Anthony
On Thu, May 17, 2012 at 12:45 AM, John  wrote:
> Simple.wikipedia is nothing like en.wikipedia I care to dispute that
> statement, All WMF wikis are setup basically the same (an odd extension here
> or there is different, and different namespace names at times) but for the
> purpose of recovery simplewiki_p is a very standard example. this issue isnt
> just about enwiki_p but *all* wmf wikis. Doing a data recovery for enwiki vs
> simplewiki is just a matter of time, for enwiki a 5 day estimate would be
> fairly standard (depending on server setup) and lower times for smaller
> databases. typically you can explain it in a rate of X revisions processed
> per Y time unit, regardless of the project. and that rate should be similar
> for everything given the same hardware setup.

Are you compressing old revisions, or not?  Does the WMF database
compress old revisions, or not?

In any case, I'm sorry, a 20 gig mysql database does not scale
linearly to a 20 terabyte mysql database.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread John
*Simple.wikipedia is nothing like en.wikipedia* I care to dispute that
statement, All WMF wikis are setup basically the same (an odd extension
here or there is different, and different namespace names at times) but for
the purpose of recovery simplewiki_p is a very standard example. this issue
isnt just about enwiki_p but *all* wmf wikis. Doing a data recovery for
enwiki vs simplewiki is just a matter of time, for enwiki a 5 day estimate
would be fairly standard (depending on server setup) and lower times for
smaller databases. typically you can explain it in a rate of X revisions
processed per Y time unit, regardless of the project. and that rate should
be similar for everything given the same hardware setup.

On Thu, May 17, 2012 at 12:37 AM, Anthony  wrote:

> On Thu, May 17, 2012 at 12:30 AM, John  wrote:
> > Ill run a quick benchmark and import the full history of
> simple.wikipedia to
> > my laptop wiki on a stick, and give an exact duration
>
> Simple.wikipedia is nothing like en.wikipedia.  For one thing, there's
> no need to turn on $wgCompressRevisions with simple.wikipedia.
>
> Is $wgCompressRevisions still used?  I haven't followed this in quite a
> while.
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread Anthony
On Thu, May 17, 2012 at 12:30 AM, John  wrote:
> Ill run a quick benchmark and import the full history of simple.wikipedia to
> my laptop wiki on a stick, and give an exact duration

Simple.wikipedia is nothing like en.wikipedia.  For one thing, there's
no need to turn on $wgCompressRevisions with simple.wikipedia.

Is $wgCompressRevisions still used?  I haven't followed this in quite a while.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread John
Ill run a quick benchmark and import the full history of simple.wikipedia
to my laptop wiki on a stick, and give an exact duration


On Thu, May 17, 2012 at 12:26 AM, John  wrote:

> Toolserver is a clone of the wmf servers minus files. they run a database
> replication of all wikis. these times are dependent on available hardware
> and may very, but should provide a decent estimate
>
>
>
> On Thu, May 17, 2012 at 12:23 AM, Anthony  wrote:
>
>> On Thu, May 17, 2012 at 12:18 AM, John  wrote:
>> > take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumpsfor
>> > exactly how to import an existing dump, I know the process of
>> re-importing
>> > a cluster for the toolserver is normally just a few days when they have
>> the
>> > needed dumps.
>>
>> Toolserver doesn't have full history, does it?
>>
>
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread John
Toolserver is a clone of the wmf servers minus files. they run a database
replication of all wikis. these times are dependent on available hardware
and may very, but should provide a decent estimate


On Thu, May 17, 2012 at 12:23 AM, Anthony  wrote:

> On Thu, May 17, 2012 at 12:18 AM, John  wrote:
> > take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumpsfor
> > exactly how to import an existing dump, I know the process of
> re-importing
> > a cluster for the toolserver is normally just a few days when they have
> the
> > needed dumps.
>
> Toolserver doesn't have full history, does it?
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread Anthony
On Thu, May 17, 2012 at 12:18 AM, John  wrote:
> take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for
> exactly how to import an existing dump, I know the process of re-importing
> a cluster for the toolserver is normally just a few days when they have the
> needed dumps.

Toolserver doesn't have full history, does it?

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread Anthony
On Thu, May 17, 2012 at 12:13 AM, John  wrote:
> that two week estimate was given worst case scenario. Given the best case
> we are talking as little as a few hours for the smaller wikis to 5 days or
> so for a project the size of enwiki. (see
> http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-May/000491.htmlfor
> progress on image dumps`)

Where are you getting these figures from?

Are you talking about a full history copy?

Also, what about the copyright issues (especially, attribution)?

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread John
take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for
exactly how to import an existing dump, I know the process of re-importing
a cluster for the toolserver is normally just a few days when they have the
needed dumps.

On Thu, May 17, 2012 at 12:13 AM, John

> wrote:

> that two week estimate was given worst case scenario. Given the best case
> we are talking as little as a few hours for the smaller wikis to 5 days or
> so for a project the size of enwiki. (see
> http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-May/000491.htmlfor 
> progress on image dumps`)
>
>
> On Wed, May 16, 2012 at 11:10 PM, Kim Bruning wrote:
>
>> On Thu, May 17, 2012 at 12:03:02AM -0400, John wrote:
>> > Except for files, getting a content clone up is relativity easy, and
>> can be
>> > done in a fairly quick order (aka less than two weeks for everything). I
>> > know there is talk about getting a rsync setup for images.
>>
>> Ouch, 2 weeks. We need the images to be replicable too though. > head>
>>
>>
>> sincerely,
>>Kim Bruning
>>
>>
>> ___
>> Wikimedia-l mailing list
>> Wikimedia-l@lists.wikimedia.org
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>>
>
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread John
that two week estimate was given worst case scenario. Given the best case
we are talking as little as a few hours for the smaller wikis to 5 days or
so for a project the size of enwiki. (see
http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-May/000491.htmlfor
progress on image dumps`)

On Wed, May 16, 2012 at 11:10 PM, Kim Bruning  wrote:

> On Thu, May 17, 2012 at 12:03:02AM -0400, John wrote:
> > Except for files, getting a content clone up is relativity easy, and can
> be
> > done in a fairly quick order (aka less than two weeks for everything). I
> > know there is talk about getting a rsync setup for images.
>
> Ouch, 2 weeks. We need the images to be replicable too though.  head>
>
>
> sincerely,
>Kim Bruning
>
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread Kim Bruning
On Thu, May 17, 2012 at 12:03:02AM -0400, John wrote:
> Except for files, getting a content clone up is relativity easy, and can be
> done in a fairly quick order (aka less than two weeks for everything). I
> know there is talk about getting a rsync setup for images.

Ouch, 2 weeks. We need the images to be replicable too though. 


sincerely,
Kim Bruning


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread John
Except for files, getting a content clone up is relativity easy, and can be
done in a fairly quick order (aka less than two weeks for everything). I
know there is talk about getting a rsync setup for images.
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


[Wikimedia-l] Fire Drill Re: Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

2012-05-16 Thread Kim Bruning
On Wed, May 16, 2012 at 11:11:04PM -0400, John wrote:
> I know from experience that a wiki can be re-built from any one of the
> dumps that are provided, (pages-meta-current) for example contains
> everything needed to reboot a site except its user database
> (names/passwords ect). see
> http://www.mediawiki.org/wiki/Manual:Moving_a_wiki


Sure. Does this include all images, including commons images, eventually
converted to operate locally?

I'm thinking about full snapshot-and-later-restore, say 25 or 50 years
from now, or in an academic setting, (or FSM-forbid in a worst case scenario
). That's what the AT folks are most interested in.

==Fire Drill== 
Has anyone recently set up a full-external-duplicate of (for instance) en.wp?
This includes all images, all discussions, all page history (excepting the user
accounts and deleted pages)

This would be a useful and important exercise; possibly to be repeated once per 
year.  

I get a sneaky feeling that the first few iterations won't go so well.

I'm sure AT would be glad to help out with the running of these fire drills, as
it seems to be in line with their mission.

sincerely,
Kim Bruning

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l