So... this seems to have snuck back in a month ago:
https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021
https://bugzilla.wikimedia.org/show_bug.cgi?id=21860
Have we resolved the deployment questions on how to actually do the change?
Just want to make sure ops has plenty of warning
On 28/11/11 08:29, Brion Vibber wrote:
So... this seems to have snuck back in a month ago:
https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021
https://bugzilla.wikimedia.org/show_bug.cgi?id=21860
I don't think it really snuck, Rob has been talking about it for a
while, see e.g.
I have no idea about the schema changes, but to choose a digest for
detection of identity reverts is pretty simple. The really difficult
part is to choose a locally sensitive hash or fingerprint that works
for very similar revisions with a lot of content.
I would propose that the digest is stored
Brion Vibber wrote:
2) Checksums would be of fairly obvious benefit to verifying text storage
integrity within MediaWiki's own databases (though perhaps best sitting on
or keyed to the text table...?) Default installs tend to use simple
plain-text or gzipped storage, but big installs like
On 19/09/11 20:11, Greg Sabino Mullane wrote:
supported by many OSs, libraries, etc. As far as collisions, there are
plenty of easy solutions, such as:
* Check for a collision before allowing a new revision, and do something
if so (to handle the pre-image attack)
* When reverting, do a
* When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
finally we don't need an index on it becomes we need an index on it, and
storage efficiency becomes much more interesting (binary packing yay ;-)
so, what are the use
On 11-09-19 11:43 PM, Domas Mituzas wrote:
* When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
finally we don't need an index on it becomes we need an index on it, and
storage efficiency becomes much more interesting
On Mon, Sep 19, 2011 at 10:39 PM, Daniel Friesen
li...@nadir-seen-fire.com wrote:
On 11-09-19 06:39 PM, Anthony wrote:
On Mon, Sep 19, 2011 at 3:57 PM, Brion Vibber br...@pobox.com wrote:
That's probably the simplest solution; adding a new empty table will be very
quick. It may make it slower
Ah, okay. I remember that's what happened in MyISAM but I figured
they had that fixed in InnoDB.
InnoDB has optimized path for index builds, not for schema changes.
Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
On Tue, Sep 20, 2011 at 9:34 AM, Domas Mituzas midom.li...@gmail.com wrote:
Ah, okay. I remember that's what happened in MyISAM but I figured
they had that fixed in InnoDB.
InnoDB has optimized path for index builds, not for schema changes.
No support for built-in function-based indexes,
On Mon, Sep 19, 2011 at 12:53 PM, Asher Feldman afeldman [at]
wikimediawrote:
Since the primary use case here seems to be offline analysis and it may
not
be of much interest to mediawiki users outside of wmf, can we store the
checksums in new tables (i.e. revision_sha1) instead of running
On Mon, Sep 19, 2011 at 12:53 PM, Asher Feldman afeldman [at]
wikimediawrote:
Since the primary use case here seems to be offline analysis and it may
not
be of much interest to mediawiki users outside of wmf, can we store the
checksums in new tables (i.e. revision_sha1) instead of running
Domas Mituzas wrote:
* When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
finally we don't need an index on it becomes we need an index on it, and
storage efficiency becomes much more interesting (binary packing yay ;-)
On 11-09-20 02:26 PM, Platonides wrote:
Domas Mituzas wrote:
* When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
finally we don't need an index on it becomes we need an index on it, and
storage efficiency becomes much more
On Tue, Sep 20, 2011 at 3:37 PM, Happy Melon happy-me...@live.com wrote:
It may or may not be an architecturally-better design to have it as a
separate table, although considering how rapidly MW's 'architecture' changes
I'd say keeping things as simple as possible is probably a virtue. But
On Tue, Sep 20, 2011 at 5:36 PM, Anthony wikim...@inbox.org wrote:
On Tue, Sep 20, 2011 at 3:37 PM, Happy Melon happy-me...@live.com wrote:
It may or may not be an architecturally-better design to have it as a
separate table, although considering how rapidly MW's 'architecture'
changes
Thanks for the explanation. I guess I see what you're getting at now.
Sorry I didn't see it sooner.
On Tue, Sep 20, 2011 at 8:50 PM, Brion Vibber br...@pobox.com wrote:
On Tue, Sep 20, 2011 at 5:36 PM, Anthony wikim...@inbox.org wrote:
On Tue, Sep 20, 2011 at 3:37 PM, Happy Melon
Some use cases:
* Dump validation (per Ariel)
* Revert detection
* Collapsing reversions in history to hide clutter
* Replacing/augmenting baseRevId hacks in FlaggedRevs
Domas Mituzas wrote:
* When reverting, do a select count(*) where md5=? and then do something
more advanced when
I ran some benchmarks on one of the WMF machines. The input I used is
a 137.5 MB (144,220,582 bytes) OGV file that someone asked me to
upload to Commons recently. For each benchmark, I hashed the file 25
times and computed the average running time.
MD5: 393 ms
SHA-1: 404 ms
On Mon, Sep 19, 2011 at 11:11 AM, Greg Sabino Mullane g...@endpoint.com wrote:
I'm also of the opinion that we should just store things as CHAR(32),
unless someone thinks space is really at that much of a premium. The big
advantage of 32 chars (i.e. 0-9a-f aka hexadecimal ) is that it's a
On Mon, Sep 19, 2011 at 2:11 PM, Greg Sabino Mullane g...@endpoint.com wrote:
* When reverting, do a select count(*) where md5=? and then do something
more advanced when more than one match is found
That sounds very painful :(
-Chad
___
Wikitech-l
[snip]
So just FYI -- the only *actual* controversy that needs to be discussed in
this thread is how do we make this update applicable in a way that doesn't
disrupt live sites with many millions of pages?
We're pretty fixed on SHA-1 as a checksum sig (already using it elsewhere)
and have no
Since the primary use case here seems to be offline analysis and it may not
be of much interest to mediawiki users outside of wmf, can we store the
checksums in new tables (i.e. revision_sha1) instead of running large
alters, and implement the code to generate checksums on new edits via an
On Mon, Sep 19, 2011 at 12:53 PM, Asher Feldman afeld...@wikimedia.orgwrote:
Since the primary use case here seems to be offline analysis and it may not
be of much interest to mediawiki users outside of wmf, can we store the
checksums in new tables (i.e. revision_sha1) instead of running large
On 11-09-19 12:57 PM, Brion Vibber wrote:
On Mon, Sep 19, 2011 at 12:53 PM, Asher Feldman afeld...@wikimedia.orgwrote:
Since the primary use case here seems to be offline analysis and it may not
be of much interest to mediawiki users outside of wmf, can we store the
checksums in new tables
On Mon, Sep 19, 2011 at 3:57 PM, Brion Vibber br...@pobox.com wrote:
That's probably the simplest solution; adding a new empty table will be very
quick. It may make it slower to use the field though, depending on what all
uses/exposes it.
Isn't adding a new column with all NULL values quick
On 11-09-19 06:39 PM, Anthony wrote:
On Mon, Sep 19, 2011 at 3:57 PM, Brion Vibber br...@pobox.com wrote:
That's probably the simplest solution; adding a new empty table will be very
quick. It may make it slower to use the field though, depending on what all
uses/exposes it.
Isn't adding a
Στις 17-09-2011, ημέρα Σαβ, και ώρα 22:55 -0700, ο/η Robert Rohde
έγραψε:
On Sat, Sep 17, 2011 at 4:56 PM, Anthony wikim...@inbox.org wrote:
snip
For offline analyses, there's no need to change the online database tables.
Need? That's debatable, but one of the major motivators is the
developers wikitech-l@lists.wikimedia.org
Sent: Sun, Sep 18, 2011 05:56:15 GMT+00:00
Subject: Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table
(discussing r94289)
On Sat, Sep 17, 2011 at 4:56 PM, Anthony wikim...@inbox.org wrote:
On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde raro...@gmail.com
On 09/18/2011 08:55 AM, Robert Rohde wrote:
people find ways to improve the attacks on SHA-1. (The existing
attacks usually require the ability to feed arbitrary binary strings
into the hash function. Given that both browsers and Mediawiki will
tend to reject binary data placed in an edit
What is the threat?
Sent from my Verizon Wireless Phone
-Original message-
From: Ilmari Karonen nos...@vyznev.net
To: Wikimedia developers wikitech-l@lists.wikimedia.org
Sent: Sun, Sep 18, 2011 20:20:34 GMT+00:00
Subject: Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table
On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
rnnel...@clarkson.edu wrote:
It is meaningless to talk about cryptography without a threat model, just as
Robert says. Is
anybody actually attacking us?
You mean, like Grawp?
___
On Sun, Sep 18, 2011 at 2:33 AM, Ariel T. Glenn ar...@wikimedia.org wrote:
Στις 17-09-2011, ημέρα Σαβ, και ώρα 22:55 -0700, ο/η Robert Rohde
έγραψε:
On Sat, Sep 17, 2011 at 4:56 PM, Anthony wikim...@inbox.org wrote:
snip
For offline analyses, there's no need to change the online database
On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
rnnel...@clarkson.edu wrote:
It is meaningless to talk about cryptography without a threat model, just as
Robert says. Is anybody actually attacking us? Or are we worried about
accidental collisions?
I believe it began as
On Sun, Sep 18, 2011 at 5:30 PM, Chad innocentkil...@gmail.com wrote:
On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
rnnel...@clarkson.edu wrote:
It is meaningless to talk about cryptography without a threat model, just as
Robert says. Is anybody actually attacking us? Or are
On Sun, Sep 18, 2011 at 5:47 PM, Anthony wikim...@inbox.org wrote:
On Sun, Sep 18, 2011 at 5:30 PM, Chad innocentkil...@gmail.com wrote:
On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
rnnel...@clarkson.edu wrote:
It is meaningless to talk about cryptography without a threat
Chad wrote:
For those of us who do not know...what the heck is a Grawp attack?
Does it involve generating hash collisions?
-Chad
It's the name of a wikipedia vandal.
http://en.wikipedia.org/wiki/User:Grawp
___
Wikitech-l mailing list
On Sun, Sep 18, 2011 at 5:50 PM, Chad innocentkil...@gmail.com wrote:
On Sun, Sep 18, 2011 at 5:47 PM, Anthony wikim...@inbox.org wrote:
On Sun, Sep 18, 2011 at 5:30 PM, Chad innocentkil...@gmail.com wrote:
On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
rnnel...@clarkson.edu
On Sun, Sep 18, 2011 at 6:01 PM, Anthony wikim...@inbox.org wrote:
There's also a
description at http://en.wikipedia.org/wiki/User:Grawp , which does
not do justice to the mad hacker skillz of this individual and his
intent on finding bugs in mediawiki and exploiting them.
(and/or the Grawp
Anthony wrote:
It does not involve generating hash collisions, but it involves
finding various bugs in mediawiki and using them to vandalise, often
by injecting javascript. The best description I could find was at
Encyclopedia Dramatica, which seems to be taken down (there's a cache
if you
On Sun, Sep 18, 2011 at 7:20 PM, Anthony wikim...@inbox.org wrote:
On Sun, Sep 18, 2011 at 7:07 PM, bawolff bawolff...@gmail.com wrote:
Anthony wrote:
The pages you link to seem to indicate he's nothing more than a
willy-on-wheels type vandal, who at worst tricked an admin into doing
a delete
On Fri, Sep 16, 2011 at 6:48 PM, Thomas Gries m...@tgries.de wrote:
Was there a certain reason to chose base 36 ?
Why not recoding to base 62 and saving 3 bytes per checksum ?
I don't know, this was way, way before my time. But then, why use base
62 if you can use base 64? Encoders/decoders for
Roan Kattouw wrote:
On Fri, Sep 16, 2011 at 6:48 PM, Thomas Griesm...@tgries.de wrote:
Was there a certain reason to chose base 36 ?
Why not recoding to base 62 and saving 3 bytes per checksum ?
I don't know, this was way, way before my time. But then, why use base
62 if you can use base
On Sat, Sep 17, 2011 at 8:26 AM, Roan Kattouw roan.katt...@gmail.com wrote:
Minor detail: I think it's more likely we'll use SHA-1 hashes rather
than MD5 hashes.
Is there a good reason to prefer SHA-1?
Both have weaknesses allowing one to construct a collision (with
considerable effort), but I
On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde raro...@gmail.com wrote:
Is there a good reason to prefer SHA-1?
Both have weaknesses allowing one to construct a collision (with
considerable effort)
Considerable effort? I can create an MD5 collision in a few minutes
on my home computer. Is
On Sat, Sep 17, 2011 at 4:56 PM, Anthony wikim...@inbox.org wrote:
On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde raro...@gmail.com wrote:
Is there a good reason to prefer SHA-1?
Both have weaknesses allowing one to construct a collision (with
considerable effort)
Considerable effort? I can
RE:
http://www.mediawiki.org/wiki/Requests_for_comment/Database_field_for_checksum_of_page_text#Field_type
Recently, Adding MD5 / SHA1 column to revision table (discussing r94289)
was discussed.
For some applications, I use the technique of representing the 128 bit
of md5 or other checksums
On Fri, Sep 16, 2011 at 8:15 AM, Thomas Gries m...@tgries.de wrote:
For some applications, I use the technique of representing the 128 bit
of md5 or other checksums
as base-62 character strings
instead of hexadecimal (base-16) strings.
MediaWiki already uses a similar
Am 16.09.2011 11:24, schrieb Roan Kattouw:
For some applications, I use the technique of representing the 128 bit
of md5 or other checksums
as base-62 character strings
instead of hexadecimal (base-16) strings.
MediaWiki already uses a similar technique, storing SHA-1 hashes
On 9/16/11 9:48 AM, Thomas Gries wrote:
Am 16.09.2011 11:24, schrieb Roan Kattouw:
For some applications, I use the technique of representing the 128 bit
of md5 or other checksums
as base-62 character strings
instead of hexadecimal (base-16) strings.
MediaWiki already uses
On 11-09-16 09:48 AM, Thomas Gries wrote:
Am 16.09.2011 11:24, schrieb Roan Kattouw:
For some applications, I use the technique of representing the 128 bit
of md5 or other checksums
as base-62 character strings
instead of hexadecimal (base-16) strings.
MediaWiki already uses a
On Fri, Sep 16, 2011 at 9:48 AM, Thomas Gries m...@tgries.de wrote:
Am 16.09.2011 11:24, schrieb Roan Kattouw:
For some applications, I use the technique of representing the 128 bit
of md5 or other checksums
as base-62 character strings
instead of hexadecimal (base-16)
2011/9/4 MZMcBride z...@mzmcbride.com
Diederik van Liere wrote:
I've suggested to generate bulk checksums as well but both Brion and
Ariel see
the primary purpose of this field to check the validity of the dump
generating
process and so they want to generate the checksums straight from
On Sat, Sep 3, 2011 at 12:33 AM, Rob Lanphier ro...@wikimedia.org wrote:
I generally suspect that a standard index is going to be a waste for
the most urgent uses of this. It will rarely be interesting to search
for common hashes between articles. The far more common case will be
to search
Thanks for moving the page.
Diederik
On 2011-09-04, at 3:29 PM, Krinkle wrote:
2011/9/4 MZMcBride z...@mzmcbride.com
Diederik van Liere wrote:
I've suggested to generate bulk checksums as well but both Brion and
Ariel see
the primary purpose of this field to check the validity of the dump
Hi,
I've suggested to generate bulk checksums as well but both Brion and Ariel see
the primary purpose of this field to check the validity of the dump generating
process and so they want to generate the checksums straight from the external
storage.
In a general sense, there are two use cases
On Sat, Sep 3, 2011 at 2:20 AM, Asher Feldman afeld...@wikimedia.org wrote:
Is code written to populate rev_sha1 on each new edit?
I believe that was part of Aaron's code that got reverted, yes.
Offline generation of hashes is definitely possible, but the only
reason you'd do it is to minimize
On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere dvanli...@gmail.com wrote:
Hi!
I am starting this thread because Brion's revision r94289 reverted
r94289 [0] stating core schema change with no discussion [1].
Bumping this: What are the remaining open questions regarding this
schema change?
Would it be possible to generate offline hashes for the bulk of our revision
corpus via dumps and load that into prod to minimize the time and impact of
the backfill?
When using for analysis, will we wish the new columns had partial indexes
(first 6 characters?)
Is code written to populate
Bug 2939 is one relevant bug to this, it could probably use an index.
[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
On 11-09-02 05:20 PM, Asher Feldman wrote:
Would it be possible to generate offline hashes for
On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen
li...@nadir-seen-fire.com wrote:
On 11-09-02 05:20 PM, Asher Feldman wrote:
When using for analysis, will we wish the new columns had partial indexes
(first 6 characters?)
Bug 2939 is one relevant bug to this, it could probably use an index.
[1]
On 11-09-02 09:33 PM, Rob Lanphier wrote:
On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen
li...@nadir-seen-fire.com wrote:
On 11-09-02 05:20 PM, Asher Feldman wrote:
When using for analysis, will we wish the new columns had partial indexes
(first 6 characters?)
Bug 2939 is one relevant bug to
Hi!
I am starting this thread because Brion's revision r94289 reverted
r94289 [0] stating core schema change with no discussion [1].
Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash
column (either md5 or sha1) in the revision table. The primary use
case of this column will be to
63 matches
Mail list logo