Bug ID: 61802
Summary: Use a different format for l10n_cache (or document why
the current one is the best one)
CC: alolita.sha...@gmail.com, asha...@wikimedia.org,
Web browser: ---
Mobile Platform: ---
Our current l10n_cache model seems to use serialised PHP arrays as the storage
mechanism for localisation strings. This makes perfect sense if we assume that
all use cases for retrieving the data are centred around PHP, which, for
production, they are. Unfortunately it's tremendously frustrating from a
research perspective. As an example, let's use namespace names and aliases,
which are stored in l10n_cache and accessible via the MediaWiki API.
Namespace names and aliases are a relatively commmon thing to need to retrieve,
at least for me, for things like introducing granularity into our request logs
or UA data.
Fortunately for our machines and unfortunately for our researchers, the
research and analytics machines are, very deliberately, not connected to the
internet directly (with the exception of stat1, which is being decommissioned).
Accordingly, the API option is not available to us if we want to retrieve
namespace names, we need to use the l10n_cace table.
Doing this requires us to be using a language with a PHP parser in it (Python
has one, R does not), roll our own if one isn't available, or write something
incredibly hacky where we read the data in, de-serialise it and save it in a
more usable format /through/, say, PHP or Python. This is an unattractive
proposition because it makes for less readable code, which is a concern not
only for transparency but in the situation where the code is 'productionised'
by the analytics engineers, for which it needs to be workable in Java.
Can we switch away from serialised PHP to, say, JSON objects? If not, why not?
Is there documentation of the rationale for using serialised PHP anywhere?
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
Wikibugs-l mailing list