Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

Jaime Crespo Tue, 12 Jul 2016 03:26:24 -0700

Your last question is a non issue for me- I do not care if things are
on the database or on configuration- that is not the issue I have been
complaining about.

What I blocked is having 6000 million rows (x40 due to redundancy)
with the same column value "gzip; version 3 (1-2-3-testing-testing. It
seems to work)" when it can be summarized as a 1-byte or less id (and
that id be explained somewhere else). The difference between both
options is extremely cheap to code and not only it would save
thousands of dollars in server cost, it would also minimize
maintenance cost and dramatically increase performance (or not
decrease it) on one of the largest bottlenecks for large wikis, as it
could fit fully into memory (yes, we have 515 GB servers now).

To give you an idea how how bad things are currently: WMF's
architecture technically does not store on the main databases servers
any data (a lot of asterisks here, allow me be inexact for the sake of
simplicity), only metadata, as the wiki content is stored on the
"external storage" subsystem. I gave a try to InnoDB compression [0]
(which has a very low compression ratio and a very small block size,
as it is for real-time purposes only), yet I was able to reduce the
disk usage to less than half by only compressing the top 10 tables:
[1]. If this is not an objective measurement of how inefficient
mediawiki schema is, I do not know how I can convince you otherwise.

Of course there are a lot of history and legacy and maintenance
issues, but when the guy that actually would spend days of his life
running schema changes so they do not affect production is the one
begging for them to happen you know there is an issue. And this is not
a "mediawiki" is bad complain- I think mediawiki is a very good piece
of software- I only want to make it better with very, very small
maintenance-like changes.

> The disadvantage is of course that the model and format are not obvious when
> eyeballing the result of an SQL query.

Are you serious? Because this is super-clear already :-P:

MariaDB  db1057 enwiki > SELECT * FROM revision LIMIT 1000,1\G
*************************** 1. row ***************************
       rev_text_id: 1161 -- what?
[...]
 rev_content_model: NULL -- what?
rev_content_format: NULL
1 row in set (0.00 sec)

MariaDB  db1057 enwiki > SELECT * FROM text WHERE old_id=1161; -- WTF, old_id?
+--------+---------------------+----------------+
| old_id | old_text            | old_flags      |
+--------+---------------------+----------------+
|   1161 | DB://rc1/15474102/0 | external,utf-8 |      -- WTF is this?
+--------+---------------------+----------------+
1 row in set (0.03 sec)

I am joking at this point, but emulating what someone that looks at
the db would say. My point is that mediawiki is no longer simple.

More recommended reading (not for you, for many developers that still
are afraid of them- and I really found many cases in the wild for
otherwise good contributors):
<https://en.wikipedia.org/wiki/Join_(SQL)>

[0] <https://phabricator.wikimedia.org/T139055>
[1] 
<https://grafana.wikimedia.org/dashboard/db/server-board?panelId=17&fullscreen&from=1467294350779&to=1467687175941&var-server=db1073&var-network=eth0>

On Tue, Jul 12, 2016 at 10:40 AM, Daniel Kinzler
<[email protected]> wrote:
> Addendum, after sleeping over this:
>
> Do we really want to manage something that is essentially configuration, 
> namely
> the set of available content models and formats, in a database table? How is 
> it
> maintained?

-- 
Jaime Crespo
<http://wikimedia.org>

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What's the "correct" content model when rev_content_model is NULL?

Reply via email to