https://bugzilla.wikimedia.org/show_bug.cgi?id=61802

            Bug ID: 61802
           Summary: Use a different format for l10n_cache (or document why
                    the current one is the best one)
           Product: MediaWiki
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: Internationalization
          Assignee: wikibugs-l@lists.wikimedia.org
          Reporter: oke...@wikimedia.org
                CC: alolita.sha...@gmail.com, asha...@wikimedia.org,
                    da...@sheetmusic.org.uk, kartik.mis...@gmail.com,
                    niklas.laxst...@gmail.com, run...@gmail.com,
                    siebr...@wikimedia.org, sucheta.ghos...@gmail.com
       Web browser: ---
   Mobile Platform: ---

Our current l10n_cache model seems to use serialised PHP arrays as the storage
mechanism for localisation strings. This makes perfect sense if we assume that
all use cases for retrieving the data are centred around PHP, which, for
production, they are. Unfortunately it's tremendously frustrating from a
research perspective. As an example, let's use namespace names and aliases,
which are stored in l10n_cache and accessible via the MediaWiki API.

Namespace names and aliases are a relatively commmon thing to need to retrieve,
at least for me, for things like introducing granularity into our request logs
or UA data.

Fortunately for our machines and unfortunately for our researchers, the
research and analytics machines are, very deliberately, not connected to the
internet directly (with the exception of stat1, which is being decommissioned).
Accordingly, the API option is not available to us if we want to retrieve
namespace names, we need to use the l10n_cace table.

Doing this requires us to be using a language with a PHP parser in it (Python
has one, R does not), roll our own if one isn't available, or write something
incredibly hacky where we read the data in, de-serialise it and save it in a
more usable format /through/, say, PHP or Python. This is an unattractive
proposition because it makes for less readable code, which is a concern not
only for transparency but in the situation where the code is 'productionised'
by the analytics engineers, for which it needs to be workable in Java.

Can we switch away from serialised PHP to, say, JSON objects? If not, why not?
Is there documentation of the rationale for using serialised PHP anywhere?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to