Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-14 Thread Daniel Kinzler
Am 14.07.2016 um 06:54 schrieb MZMcBride:
> I just read some chatter about slots and multiplexing(?). It seems vaguely
> interesting, but I don't have enough context or knowledge to understand
> much of the discussion currently. Is there a request for comments page or
> some kind of documentation that defines and explains these concepts?

Currently, there is only .
I plan to move it to a wiki page and update it with the current draft and open
questions from the various discussions.


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-13 Thread MZMcBride
Daniel Kinzler wrote:
>What we still need to figure out is how to solve the chicken-and-egg
>situation with Multi-Content-Rev. At the moment, I'm thinking this might
>work:
>
>* introduce content model (and format) registry in the DB, and populate
>  it.
>* leave page and revision table as they are for now.
>* introduce slots table, use the new content_model (and content_format)
>  table.
>* stop using the content model (and format) from the page and revision
>  tables
>* drop the content model (and format) from the page and revision tables
>
>Does that sound liek a good plan? Let's for a moment assume we can get
>slots fully rolled out by the end of the year.

I just read some chatter about slots and multiplexing(?). It seems vaguely
interesting, but I don't have enough context or knowledge to understand
much of the discussion currently. Is there a request for comments page or
some kind of documentation that defines and explains these concepts?

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-13 Thread Matthew Flaschen



On 07/11/2016 10:10 AM, Brad Jorsch (Anomie) wrote:

On Mon, Jul 11, 2016 at 8:07 AM, Daniel Kinzler 

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Brion Vibber
On Tuesday, July 12, 2016, Daniel Kinzler 
wrote:

> Am 12.07.2016 um 18:00 schrieb Rob Lanphier:
> > On Tue, Jul 12, 2016 at 1:40 AM, Daniel Kinzler <
> daniel.kinz...@wikimedia.de 
> >> The original design of ContentHandler used integer IDs for content
> models
> >> and formats in the DB. A mapping to human readable names is only needed
> >> for logging and error messages anyway.
> >
> > This oversimplifies things greatly.  Integer IDs need to be mapped to
> some
> > well-specified, non-local (global?) identifier for many many purposes
> > (reading exports, writing exports, reading site content, displaying site
> > content for many contexts, etc)
>
> Yea, sorry. That we only need this for logging is what I assumed back
> then. Not
> exposing the numeric ID at all, and using the canonical name in dumps, the
> API,
> etc, avoids a lot of trouble (but doesn't come free).


Yes, numeric ids are internal and never to be exposed ideally. We should've
done same wth namespaces but got dragged into compat hell. :)


>
> > We need to put a lot of thought into content model management generally.
> > This statement implies managing content models outside of the database is
> > easy.
>
> Well, it's the same as namespaces: they are easy to set up, but also too
> easy to
> change, so it's easy to create a mess...
>
> As explained in my earlier response, I now realized that content models
> differ
> from namespaces in that they are not really configured by people, but
> rather
> registered by extensions. That makes it a lot less awkward to have them in
> the
> database. We still have to agree on a good trigger for the registration,
> but it
> doesn't seem to be a tricky issue.


Yeah an auto insert if needed is good in theory, though I worry about write
contention on the central mapping table. If no write locks kept in the
common case of no insertion needed then I think the ideas proposed should
work.


>
> What we still need to figure out is how to solve the chicken-and-egg
> situation
> with Multi-Content-Rev. At the moment, I'm thinking this might work:
>
> * introduce content model (and format) registry in the DB, and populate it.
> * leave page and revision table as they are for now.
> * introduce slots table, use the new content_model (and content_format)
> table.
> * stop using the content model (and format) from the page and revision
> tables
> * drop the content model (and format) from the page and revision tables
>
> Does that sound liek a good plan? Let's for a moment assume we can get
> slots
> fully rolled out by the end of the year.


This sounds good to me - lets us introduce a more space efficient model
mapping and drop the extra fields from page and rev later.

-- brion


>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Daniel Kinzler
Am 12.07.2016 um 21:02 schrieb Rob Lanphier:
> On Tue, Jul 12, 2016 at 8:02 AM, Brad Jorsch (Anomie)
>  wrote:
>> One simple method: assign the numeric IDs by making the numeric ID column
>> auto-increment, and insert the model strings into the table as needed.
>> PageAssessments uses this model for tracking its project tags.[1]
>>
>> The disadvantage is that there wouldn't be any cross-wiki mapping between
>> model names and ids, which can be mitigated somewhat by never exposing the
>> ids externally.
> 
> Could you explain this idea in a way that doesn't require diving into
> the codebase to figure out what you mean?  Cloaking the mapping of
> local ids (e.g. auto incremented in the DB) to global ids ("model
> names") seems to suggest a new way of making our system behave in an
> inscrutable way.

The idea is that in API responses (and requests), in XML dumps, etc, the content
model for wikitext will be represented as the string "wikitext", even if the
internal ID is 1 in the database of one wiki, and 37 on another. Clients have to
know the canonical names, they are not concerned with the internal ids. They are
considered an internal optimization, an implementation detail.


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Rob Lanphier
On Tue, Jul 12, 2016 at 8:02 AM, Brad Jorsch (Anomie)
 wrote:
> One simple method: assign the numeric IDs by making the numeric ID column
> auto-increment, and insert the model strings into the table as needed.
> PageAssessments uses this model for tracking its project tags.[1]
>
> The disadvantage is that there wouldn't be any cross-wiki mapping between
> model names and ids, which can be mitigated somewhat by never exposing the
> ids externally.

Could you explain this idea in a way that doesn't require diving into
the codebase to figure out what you mean?  Cloaking the mapping of
local ids (e.g. auto incremented in the DB) to global ids ("model
names") seems to suggest a new way of making our system behave in an
inscrutable way.

On Tue, Jul 12, 2016 at 9:00 AM, Brad Jorsch (Anomie)
 wrote:
>  [Does this namespace registry idea work?]
>
> https://www.mediawiki.org/wiki/Extension_default_namespaces?

 That doesn't seem like a good model to emulate.  We're not iana.org,
and we don't have anywhere near the rigor defined in IETF RFC 5226.  I
may put further thoughts on this topic in the Interwiki map RFC
(T113034) task

Rob

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Daniel Kinzler
Am 12.07.2016 um 18:00 schrieb Rob Lanphier:
> On Tue, Jul 12, 2016 at 1:40 AM, Daniel Kinzler > The original design of ContentHandler used integer IDs for content models 
>> and formats in the DB. A mapping to human readable names is only needed
>> for logging and error messages anyway.
> 
> This oversimplifies things greatly.  Integer IDs need to be mapped to some
> well-specified, non-local (global?) identifier for many many purposes
> (reading exports, writing exports, reading site content, displaying site
> content for many contexts, etc)

Yea, sorry. That we only need this for logging is what I assumed back then. Not
exposing the numeric ID at all, and using the canonical name in dumps, the API,
etc, avoids a lot of trouble (but doesn't come free).

> We need to put a lot of thought into content model management generally.
> This statement implies managing content models outside of the database is
> easy.

Well, it's the same as namespaces: they are easy to set up, but also too easy to
change, so it's easy to create a mess...

As explained in my earlier response, I now realized that content models differ
from namespaces in that they are not really configured by people, but rather
registered by extensions. That makes it a lot less awkward to have them in the
database. We still have to agree on a good trigger for the registration, but it
doesn't seem to be a tricky issue.

What we still need to figure out is how to solve the chicken-and-egg situation
with Multi-Content-Rev. At the moment, I'm thinking this might work:

* introduce content model (and format) registry in the DB, and populate it.
* leave page and revision table as they are for now.
* introduce slots table, use the new content_model (and content_format) table.
* stop using the content model (and format) from the page and revision tables
* drop the content model (and format) from the page and revision tables

Does that sound liek a good plan? Let's for a moment assume we can get slots
fully rolled out by the end of the year.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Brad Jorsch (Anomie)
On Tue, Jul 12, 2016 at 11:47 AM, Daniel Kinzler <
daniel.kinz...@wikimedia.de> wrote:

> Am 12.07.2016 um 17:02 schrieb Brad Jorsch (Anomie):
> > On Tue, Jul 12, 2016 at 4:40 AM, Daniel Kinzler <
> daniel.kinz...@wikimedia.de
> >
> > One simple method: assign the numeric IDs by making the numeric ID column
> > auto-increment, and insert the model strings into the table as needed.
>
> When exactly? When update.php runs? Should work fine, but I'd like a nice
> interface that extensions can use for this. Or should we check and
> auto-insert
> on every page edit?
>

The linked example is inserting (if necessary) on every page edit. The
check part needs to happen on every edit anyway because it needs to fetch
the ID for the name.

update.php would work too as long as things blow up clearly when someone
didn't run update.php recently enough. That could also allow us to let the
extension suggest an ID, so the registrar would only have to assign a
"random" ID in case of a conflict.


> > Does the registry idea work all that smoothly for namespaces, though?
>
> I don't think it was ever really tried for namespace. But it's not a
> perfect
> solution. Just a possibility.
>

https://www.mediawiki.org/wiki/Extension_default_namespaces?


-- 
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Rob Lanphier
On Tue, Jul 12, 2016 at 1:40 AM, Daniel Kinzler  wrote:

> Do we really want to manage something that is essentially configuration,
> namely
> the set of available content models and formats, in a database table? How
> is it
> maintained?
>
> For context:
> * As per T113034, we are movign away from managing interwiki prefixes in
> the
> database, in favor of configuration files.
> * Namespace IDs are defined in LocalSettings.php.
>
> The original design of ContentHandler used integer IDs for content models
> and
> formats in the DB. A mapping to human readable names is only needed for
> logging
> and error messages anyway.


This oversimplifies things greatly.  Integer IDs need to be mapped to some
well-specified, non-local (global?) identifier for many many purposes
(reading exports, writing exports, reading site content, displaying site
content for many contexts, etc)

As Jaime points out, we don't want or need 6 billion copies of the same
identifier in our database.  However, relegating that information to
LocalSettings.php means that we'll have to manually sync that critical
configuration data for use by non-PHP implementations interacting with the
information.

On Tue, Jul 12, 2016 at 3:40 AM, Daniel Kinzler  wrote:
>
> I'm fine with the DB based solution, if we have decent tooling for
> extensions to
> register their content models, etc.


We need to put a lot of thought into content model management generally.
This statement implies managing content models outside of the database is
easy.

Rob



>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Daniel Kinzler
Am 12.07.2016 um 17:02 schrieb Brad Jorsch (Anomie):
> On Tue, Jul 12, 2016 at 4:40 AM, Daniel Kinzler 
> One simple method: assign the numeric IDs by making the numeric ID column
> auto-increment, and insert the model strings into the table as needed.

When exactly? When update.php runs? Should work fine, but I'd like a nice
interface that extensions can use for this. Or should we check and auto-insert
on every page edit?

To answer my own question about config in the database: unlike interwiki/sites
and namespaces, this isn't realyl configuration, it's a registry used by
extensions. Users may freely derfine namespaces for their wiki, but they can't
freely define content models.

> The disadvantage is that there wouldn't be any cross-wiki mapping between
> model names and ids, which can be mitigated somewhat by never exposing the
> ids externally.

Yes, we should definitly not expose those!

> Does the registry idea work all that smoothly for namespaces, though?

I don't think it was ever really tried for namespace. But it's not a perfect
solution. Just a possibility.


-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Brad Jorsch (Anomie)
On Tue, Jul 12, 2016 at 4:40 AM, Daniel Kinzler  wrote:

> Do we really want to manage something that is essentially configuration,
> namely the set of available content models and formats, in a database
> table? How is it maintained?
>

One simple method: assign the numeric IDs by making the numeric ID column
auto-increment, and insert the model strings into the table as needed.
PageAssessments uses this model for tracking its project tags.[1]

The disadvantage is that there wouldn't be any cross-wiki mapping between
model names and ids, which can be mitigated somewhat by never exposing the
ids externally.

 [1]:
https://phabricator.wikimedia.org/diffusion/EPAS/browse/master/PageAssessmentsBody.php;c7b21e97f650face3a257ab70763a5abad420992$41-44

>
> Such a mapping could be maintain in LocalSettings.php, just like we do for
> namespaces. This would also serve to avoid ID clashes. My idea back then
> was to have a sort of registry on mediawiki.org where extensions could
> reserve an ID for themselves, so that the same ID would stand for the same
> model everywhere.
>

Does the registry idea work all that smoothly for namespaces, though?


-- 
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Daniel Kinzler
Am 12.07.2016 um 13:23 schrieb Jaime Crespo:
> On Tue, Jul 12, 2016 at 12:40 PM, Daniel Kinzler
>  wrote:
>> Yea, still something we need to figure out :)
> 
>> That was, if I remember correctly, one of the arguments for using readable
>> strings there, instead of int values and a config variable, as I originally
>> proposed. This was discussed at the last Berlin hackathon, must have been 
>> 2012.
>> Tim may remember more details. We should probably re-consider the pros and 
>> cons
>> we discussed back then when planning to change the scham now.
> 
> But that was already re-reviewed and discussed and approved by Tim
> himself (among others) on 2015:
> .


Yes, I saw that. And I'm happy about it! But the aspect of maintainance and
tooling seems to be completely absent from the discussion and proposal. From a
DB perspective, looks fine. I just feel it is missing a few crucial bits. Like,
how does anything ever get into these tables?

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Jaime Crespo
On Tue, Jul 12, 2016 at 12:40 PM, Daniel Kinzler
 wrote:
> Yea, still something we need to figure out :)

> That was, if I remember correctly, one of the arguments for using readable
> strings there, instead of int values and a config variable, as I originally
> proposed. This was discussed at the last Berlin hackathon, must have been 
> 2012.
> Tim may remember more details. We should probably re-consider the pros and 
> cons
> we discussed back then when planning to change the scham now.

But that was already re-reviewed and discussed and approved by Tim
himself (among others) on 2015:
.

-- 
Jaime Crespo


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Daniel Kinzler
Am 12.07.2016 um 12:25 schrieb Jaime Crespo:
> Your last question is a non issue for me- I do not care if things are
> on the database or on configuration- that is not the issue I have been
> complaining about.

Yea, still something we need to figure out :)

I'm fine with the DB based solution, if we have decent tooling for extensions to
register their content models, etc.

> What I blocked is having 6000 million rows (x40 due to redundancy)
> with the same column value "gzip; version 3 (1-2-3-testing-testing. It
> seems to work)" when it can be summarized as a 1-byte or less id (and
> that id be explained somewhere else). 

Yea, that's not what I would recommend either. What I meant is that we can now,
as a stepping stone and without blocking on a schema change, fill in the null
values in the revision table for the revisions of a page that is being converted
to a new model, to avoid confusion. Converting pages to a different model is
relatively rare, so I think it would not have much of an impact on the big 
picture.

> Of course there are a lot of history and legacy and maintenance
> issues, but when the guy that actually would spend days of his life
> running schema changes so they do not affect production is the one
> begging for them to happen you know there is an issue. And this is not
> a "mediawiki" is bad complain- I think mediawiki is a very good piece
> of software- I only want to make it better with very, very small
> maintenance-like changes.

I'm all for it!

> 
>> The disadvantage is of course that the model and format are not obvious when
>> eyeballing the result of an SQL query.
> 
> Are you serious? Because this is super-clear already :-P:

That was, if I remember correctly, one of the arguments for using readable
strings there, instead of int values and a config variable, as I originally
proposed. This was discussed at the last Berlin hackathon, must have been 2012.
Tim may remember more details. We should probably re-consider the pros and cons
we discussed back then when planning to change the scham now.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Jaime Crespo
Your last question is a non issue for me- I do not care if things are
on the database or on configuration- that is not the issue I have been
complaining about.

What I blocked is having 6000 million rows (x40 due to redundancy)
with the same column value "gzip; version 3 (1-2-3-testing-testing. It
seems to work)" when it can be summarized as a 1-byte or less id (and
that id be explained somewhere else). The difference between both
options is extremely cheap to code and not only it would save
thousands of dollars in server cost, it would also minimize
maintenance cost and dramatically increase performance (or not
decrease it) on one of the largest bottlenecks for large wikis, as it
could fit fully into memory (yes, we have 515 GB servers now).

To give you an idea how how bad things are currently: WMF's
architecture technically does not store on the main databases servers
any data (a lot of asterisks here, allow me be inexact for the sake of
simplicity), only metadata, as the wiki content is stored on the
"external storage" subsystem. I gave a try to InnoDB compression [0]
(which has a very low compression ratio and a very small block size,
as it is for real-time purposes only), yet I was able to reduce the
disk usage to less than half by only compressing the top 10 tables:
[1]. If this is not an objective measurement of how inefficient
mediawiki schema is, I do not know how I can convince you otherwise.

Of course there are a lot of history and legacy and maintenance
issues, but when the guy that actually would spend days of his life
running schema changes so they do not affect production is the one
begging for them to happen you know there is an issue. And this is not
a "mediawiki" is bad complain- I think mediawiki is a very good piece
of software- I only want to make it better with very, very small
maintenance-like changes.

> The disadvantage is of course that the model and format are not obvious when
> eyeballing the result of an SQL query.

Are you serious? Because this is super-clear already :-P:

MariaDB  db1057 enwiki > SELECT * FROM revision LIMIT 1000,1\G
*** 1. row ***
   rev_text_id: 1161 -- what?
[...]
 rev_content_model: NULL -- what?
rev_content_format: NULL
1 row in set (0.00 sec)

MariaDB  db1057 enwiki > SELECT * FROM text WHERE old_id=1161; -- WTF, old_id?
++-++
| old_id | old_text| old_flags  |
++-++
|   1161 | DB://rc1/15474102/0 | external,utf-8 |  -- WTF is this?
++-++
1 row in set (0.03 sec)

I am joking at this point, but emulating what someone that looks at
the db would say. My point is that mediawiki is no longer simple.

More recommended reading (not for you, for many developers that still
are afraid of them- and I really found many cases in the wild for
otherwise good contributors):



[0] 
[1] 


On Tue, Jul 12, 2016 at 10:40 AM, Daniel Kinzler
 wrote:
> Addendum, after sleeping over this:
>
> Do we really want to manage something that is essentially configuration, 
> namely
> the set of available content models and formats, in a database table? How is 
> it
> maintained?

-- 
Jaime Crespo


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-12 Thread Daniel Kinzler
Addendum, after sleeping over this:

Do we really want to manage something that is essentially configuration, namely
the set of available content models and formats, in a database table? How is it
maintained?

For context:
* As per T113034, we are movign away from managing interwiki prefixes in the
database, in favor of configuration files.
* Namespace IDs are defined in LocalSettings.php.

The original design of ContentHandler used integer IDs for content models and
formats in the DB. A mapping to human readable names is only needed for logging
and error messages anyway. Such a mapping could be maintain in
LocalSettings.php, just like we do for namespaces. This would also serve to
avoid ID clashes. My idea back then was to have a sort of registry on
mediawiki.org where extensions could reserve an ID for themselves, so that the
same ID would stand for the same model everywhere.

The disadvantage is of course that the model and format are not obvious when
eyeballing the result of an SQL query. It also makes database dumps more
brittle, since they cannot be interpreted without knowledge of the format and
model identifiers. That's an argument for having these in the DB.

Still... configuration in the database is nasty to maintain by hand, and also
annoying for extensions that define content models. Do we introduce a simple
hook that makes sure the content model and format gets registered in the 
database?


Am 11.07.2016 um 21:26 schrieb Daniel Kinzler:
> Hi Jaime, thanks for the pointer! I had completely forgotten about that.
> 
> A few thoughts about that RFC:
> 
> * I have long thought that content_format is pretty pointless and redundant. I
> haven't seen any content model that uses different serialization formats (I
> wrote a few that support two, but only ever used one). If the serialization 
> does
> need to change for some reason, it's usually easy to detect from the first few
> bytes.
> 
> * What we need instead is versioning on the content model. It happens quite
> often that the data structure you store changes slightly. Knowing what version
> you are dealing with is quite helpful when deserializing and processing. These
> differences are much harder to auto-detect than the serialization format,
> 
> * Per-page and per-revision content model will become redundant with
> Multi-Content-Revisions. We will instead have this info in the revision_slot
> table (multiple per revision). The same design still applies, but changing the
> page and revision table would be pointless. We would just ignore the content
> model (and format) in the page and revision table, and rely on the info for 
> the
> slot table instead. At some point, we can then drop this info from page and
> revision.
> 
> I propose to introduce the content_model (and maybe also content_format) 
> tables,
> but not touch the page and revision table for now. Instead, we introduce
> revision_slots for Multi-Content-Revisions first, using the content_model 
> table,
> and introduce model versioning; maybe drop the format in the process.
> 
> What do you think?
> 
> Am 11.07.2016 um 14:27 schrieb Jaime Crespo:
>> On Mon, Jul 11, 2016 at 2:07 PM, Daniel Kinzler
>>  wrote:
>>> It seems there is disagreement about what the correct interpretation of 
>>> NULL in
>>> the rev_content_model column is. Should NULL there mean
>>
>>> What should we write into rev_content_model in the future
>>
>> Content model handling is pending a refactoring:
>> 
>> Once that happens, they should never be NULL.
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
> 
> 


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-11 Thread Stas Malyshev
Hi!

> It seems there is disagreement about what the correct interpretation of NULL 
> in
> the rev_content_model column is. Should NULL there mean
> 
> (a) "the current page content model, as recorded in page_content_model"
> 
> or should it mean
> 
> (b) "the default for this title, no matter what page_content_model says"?

As I understand, NULL is there as a space-saving measure. So I guess we
want to ask ourselves if we want to go to so much trouble to save space...

Abstractly, a) looks better than b) for me since the scenario where
default changed and all pages with all default are now broken is avoided
there. OTOH, if the pages are updated together with the default, that
must have caused page_content_model to update too, so in this case a)
should work too.

> There is also an in-between option, let's call it a/b: fall back to
> page_content_model for the latest revision (that should *always* be right), 
> but
> to ignore page_content_model for older revisions. That would cater to use case

This may be even better, since page record is supposed to match latest
revisions, but not prior revisions. That still leaves prior revisions in
case of default change broken, but at least current one isn't.


-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-11 Thread Brian Wolff
On Monday, July 11, 2016, Daniel Kinzler 
wrote:
> Am 11.07.2016 um 17:43 schrieb Brian Wolff:
>> To me, (b) makes more sense, as all the other fields in page represent
the
>> info for the current revision. Additionally all the fields in revision
>> (except rev_deleted) are immutable and never change, and definitely dont
>> change interpretation based on other db fields. Having old revisions
have a
>> dependency on the page table (especially a dependency going in the
>> direction revision->page) seems wrong to me.
>
> The question is whether you want the interpretation of that field to
depend on
> another database field related to the same page, or on global
configuration.
> Both seem wrong, but depending on config seems worse: in the cases where
it
> happens, there is no way to fix it. A database field can at least be
updated.
>

I guess this ultimately comes down to fairly arbitrary opinions, but I do
actually think making the interpertation dependency graph of the database
that convoluted is a bigger evil then depending on global config.

--
bawolff
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-11 Thread Brian Wolff
On Monday, July 11, 2016, Daniel Kinzler 
wrote:
> Hi Jaime, thanks for the pointer! I had completely forgotten about that.
>
> A few thoughts about that RFC:
>
> * I have long thought that content_format is pretty pointless and
redundant. I
> haven't seen any content model that uses different serialization formats
(I
> wrote a few that support two, but only ever used one). If the
serialization does
> need to change for some reason, it's usually easy to detect from the
first few
> bytes.
>

As an aside, ive been recently (as in literally last week) been doing some
stuff using multiple serialization formats (specificly i wanted the user to
be able to choose what format to edit as, but always save in the canonical
format). Its working pretty well for my use case. Two issues i encountered
was the show diff button on edit page totally broken (T139249) and there is
no way to separate out default format for editing from default format for
db.

(Sorry if this is off topic, i just wanted to mention im actually using
content format, albeit not the db part of it).

--
bawolff
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-11 Thread Daniel Kinzler
Am 11.07.2016 um 17:43 schrieb Brian Wolff:
> To me, (b) makes more sense, as all the other fields in page represent the
> info for the current revision. Additionally all the fields in revision
> (except rev_deleted) are immutable and never change, and definitely dont
> change interpretation based on other db fields. Having old revisions have a
> dependency on the page table (especially a dependency going in the
> direction revision->page) seems wrong to me.

The question is whether you want the interpretation of that field to depend on
another database field related to the same page, or on global configuration.
Both seem wrong, but depending on config seems worse: in the cases where it
happens, there is no way to fix it. A database field can at least be updated.

Am 11.07.2016 um 16:10 schrieb Brad Jorsch (Anomie):
> Both your (a) and (b) are wrong in some cases. Until we really fix it, we
> should probably just stick with the current (b) instead of dealing with the
> hassle of switching between one bad option and another.

Yea, I agree that it's generally better to stick with the evil you know. But
then, if one kind of wrongness has a lot more impact than the other, that may
tip the scale the other way...


But in any case, it seems we have to just fix the data in the database to get
around the issue. My problem is now, when installing Wikibase, how do we detect
which revisions need rewriting?

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-11 Thread Daniel Kinzler
Hi Jaime, thanks for the pointer! I had completely forgotten about that.

A few thoughts about that RFC:

* I have long thought that content_format is pretty pointless and redundant. I
haven't seen any content model that uses different serialization formats (I
wrote a few that support two, but only ever used one). If the serialization does
need to change for some reason, it's usually easy to detect from the first few
bytes.

* What we need instead is versioning on the content model. It happens quite
often that the data structure you store changes slightly. Knowing what version
you are dealing with is quite helpful when deserializing and processing. These
differences are much harder to auto-detect than the serialization format,

* Per-page and per-revision content model will become redundant with
Multi-Content-Revisions. We will instead have this info in the revision_slot
table (multiple per revision). The same design still applies, but changing the
page and revision table would be pointless. We would just ignore the content
model (and format) in the page and revision table, and rely on the info for the
slot table instead. At some point, we can then drop this info from page and
revision.

I propose to introduce the content_model (and maybe also content_format) tables,
but not touch the page and revision table for now. Instead, we introduce
revision_slots for Multi-Content-Revisions first, using the content_model table,
and introduce model versioning; maybe drop the format in the process.

What do you think?

Am 11.07.2016 um 14:27 schrieb Jaime Crespo:
> On Mon, Jul 11, 2016 at 2:07 PM, Daniel Kinzler
>  wrote:
>> It seems there is disagreement about what the correct interpretation of NULL 
>> in
>> the rev_content_model column is. Should NULL there mean
> 
>> What should we write into rev_content_model in the future
> 
> Content model handling is pending a refactoring:
> 
> Once that happens, they should never be NULL.
> 
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> 


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-11 Thread Brian Wolff
On Monday, July 11, 2016, Daniel Kinzler 
wrote:
> It seems there is disagreement about what the correct interpretation of
NULL in
> the rev_content_model column is. Should NULL there mean
>
> (a) "the current page content model, as recorded in page_content_model"
>
> or should it mean
>
> (b) "the default for this title, no matter what page_content_model says"?
>
>
> Kunal and I have had an unintentional edit war about this question in
Revision.php:
>
> Kunal changed it from (a) to (b) in
https://gerrit.wikimedia.org/r/#/c/222043/
> I later changed it from (b) to (a) in
https://gerrit.wikimedia.org/r/#/c/297787/
> Kunal reverted me from (a) to (b) in
https://gerrit.wikimedia.org/r/#/c/298239/
>
>
> So, which way do we want it?
>
>
> The conflict seems to arise from (at least) three competing use cases:
>
> I) re-interpreting page content. For instance, a user may move a misnamed
> User:Foo.jss to User:Foo.js. In this case, the content should be
re-interpreted
> as JavaScript, including all old revisions. This would be in favor of
behavior
> (a), though it still works with (b), because the default model changes
based on
> the suffix ".js". I think it would however be better to only rely on title
> parsing magic once, when creating the page, not later, when rendering old
revisions.
>
> II) converting page content. For instance, if a talk page gets converted
to
> using Flow, new revisions (and page_content_model) will have the Flow
model,
> while old revisions need to keep their original wikitext model (even
though
> their rev_content_model is null). That would need behavior (b).
>
> III) changing a namespace's default content model. E.g. when installing an
> extension that changes the default content model of a namespace (such as
> Wikibase with Items in the main namespace, or Flow-per-default for Talk
pages),
> existing pages that were already in that namespace should still be
readable.
> With (b), this would fail: even though page_content_model has the correct
model
> for reading the page, rev_content_model is null, so the new namespace
default is
> used, which will fail. With (a), this would simply work: the page will be
> rendered according to page_content_model.
>
>
> In all cases it's possible to resolve the issue by replacing the NULL
entries
> for all revisions of a page with the current model id. The question is
just when
> and how we do that, and when and how we can even detect that this needs
doing.
>
> There is also an in-between option, let's call it a/b: fall back to
> page_content_model for the latest revision (that should *always* be
right), but
> to ignore page_content_model for older revisions. That would cater to use
case
> III at least in so far as it would be possible to view the "misplaced"
pages.
> But viewing old revisions or diffs would still fail with a nasty error.
This
> option may look better on the surface, but I fear it will just add to the
confusion.
>
> There's another fix: never write null into rev_content_model. Always put
the
> actual model ID there. That's pretty wasteful, but it's robust and
reliable.
> When we decided to use null as a placeholder for the default, we assumed
the
> default would never change. But as we now see, it sometimes does...
>
>
> So, what should it be, option (a) or (b)? And how do we address the use
case
> that is then broken? What should we write into rev_content_model in the
future?
>
> I personally think that option (a) makes more sense, because the
resolutions of
> defaults is then local to the database. It could even be done within the
SQL
> query. It's easier to maintain consistency that way. For use case II,
that would
> require us to "fill in" all the rev_content_model fields in old revisions
when
> converting a page. I think it would be a good thing to do that. If we
have the
> content model change between revisions, it seems prudent to record it
explicitly.
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

To me, (b) makes more sense, as all the other fields in page represent the
info for the current revision. Additionally all the fields in revision
(except rev_deleted) are immutable and never change, and definitely dont
change interpretation based on other db fields. Having old revisions have a
dependency on the page table (especially a dependency going in the
direction revision->page) seems wrong to me.

--
bawolff
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-11 Thread Brad Jorsch (Anomie)
On Mon, Jul 11, 2016 at 8:07 AM, Daniel Kinzler  wrote:

> There's another fix: never write null into rev_content_model. Always put
> the
> actual model ID there. That's pretty wasteful, but it's robust and
> reliable.
>

This. We probably would have done this a long time ago except it's blocked
on T105652 so it won't significantly expand the size of the revision table,
and that you blocked by T107595.

Both your (a) and (b) are wrong in some cases. Until we really fix it, we
should probably just stick with the current (b) instead of dealing with the
hassle of switching between one bad option and another.


-- 
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

2016-07-11 Thread Jaime Crespo
On Mon, Jul 11, 2016 at 2:07 PM, Daniel Kinzler
 wrote:
> It seems there is disagreement about what the correct interpretation of NULL 
> in
> the rev_content_model column is. Should NULL there mean

> What should we write into rev_content_model in the future

Content model handling is pending a refactoring:

Once that happens, they should never be NULL.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l