https://bugzilla.wikimedia.org/show_bug.cgi?id=60629

       Web browser: ---
            Bug ID: 60629
           Summary: HTTP 500 - {{#language:codeʹ1|code2}} if code2
                    contains single/double quotes or ampersand
           Product: MediaWiki extensions
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: ParserFunctions
          Assignee: wikibugs-l@lists.wikimedia.org
          Reporter: verd...@wanadoo.fr
    Classification: Unclassified
   Mobile Platform: ---

Summary: we get a Server HTTP error 500 instantly with
  {{#language:code1|code2}}
if code2 contains a single or double quote, or an ampersand.

So,
  {{#language:en|'}} or
  {{#language:en|"}} or
  {{#language:en|&}}
DO crash. As these three character are not valid in BCP47 language/locale codes
(or the few legacy non-standard codes used in Wikimedia sites and remaining in
various historic pages), the "codes" in parameter are returned verbatim without
mapping them to a native language name.

But,
  {{#language:'}} or {{#language:'|en}} or
  {{#language:"}} or {{#language:"|en}} or
  {{#language:&}} or {{#language:&|en}}
DO NOT crash: the y are returned verbatim (in fact only as decimal numeric
character entities.

Details follow.

-------

No language codes shoud ever contain these three characters (but some local
extensions may want to use other characters such as spaces/underscores, colons,
slashes, arrobaces, dots... but these don't crash the #language function, not
even if we attempt to feed non-ASCII characters), so any occurence of these
characters in parameter 1 will make #language return the input string verbatim
without translating it, so:

"{{#language:français}}" returns "français"
"{{#language:Slovopedia}}" returns "Slovopedia"

Now let's use a valid language code in parameter but feed the second parameter
(to indicate that we want the language name translated in another target
language, if possible:

"{{#language:fr|en}}" returns "French"
"{{#language:fr|fr}}" returns "français"
"{{#language:fr|de}}" returns "Französisch"

OK now with missing translations (and no fallback):

"{{#language:pdc|ckb}}" returns "Pennsylvany German":
  both codes are valid, there's no other fallback than English

"{{#language:pdc|ckb-brai}}" returns "Pennsylvany German":
  both codes are valid BCP47 codes, but the Braille script variant of language
code "ckb" is still undefined (this would require implementing the
transliteration scheme to Braille for this language); the server may retry
using BCP47 rules looking for a translation in "diq" only, it does not find it,
and after looking for defined fallbacks of "ckb", will finally select the
default to give a name of "pdc" in English.

Now with invalid codes:
"{{#language:pdc|ckb+brai(1)}}" returns "Plattdütsch":
  the second code is invalid under all rules, so it is ignored. No fallback
chain can be determined, so the server will try to find the native name (all
supported languages in MEdiaWiki have a native name or "autonym".

Now with invalid codes including the apostrophe-quote:
"{{{#language:pdc|ckb it's failing}}" the server crashes with HTTP 500.

This is a serious issue which, could cause a DoS attack on the server, if the
following very simple code:
"{{#language:en|'}}"
is inserted in a widely used template, so that it will block the navigation
over lots of page (and many server error 500 may drain a lot of resources, if
thie eror 500 comes from a PHP instance crash that must be restarted).

This code could be generated by feeding the second parameter with a subpagename
(coming from {{SUBPAGENAME}} where it is HTML-encoded, or from {{SUBPAGENAMEE}}
where it is URL-encoded with the legacy "WIKI" style).

To correct this:

The 2nd parameter of #language must be checked like the 1st one; if the string
is longer than allowed language codes (you could accept up to the max length of
a page name), or if it contains characters in ['"&], treat this parameter as an
invalid language code, and ignore it (but you can still use the 1st code to
return the autonym mapped to it)

For now, on Mediawiki-Wiki I completed the following article about the issues
and tricky details (and other related bugs/inconsistencies I discovered)

[[mw:Manual:PAGENAMEE encoding]]

Look at the table in this page showing the effects of the various encodings
used in pagenames or for the three styles of urlencodings and anchorencode.


But the real issue in this bug report is in #language.


To avoid this bug, in pages that attempt to detect if a page is a translation
or the source page of trnaslations by checking the content of their last
subpagename, I also performed many tests to make sure that



[[m:Template:Pagelang]] on Meta-Wiki and on MediaWiki-Wiki will now NEVER
return any subpage name that:

* matches the full page (this is not a subpage of another base page, so it is
not a translation produced by the Translate extension).

* is idempotent through {{lc:{{PAGENAME|...}}}}
  (this excludes subpagenames containing capital letters and any characters
forbidden or transformed in pagenames)

* contains any character that remains HTML-encoded after calling {{titleparts}}
  (these are the three characters ['"&])

* contains any other characters than [a-z0-9-.], i.e
  the only characters that are idempotent in all encodings, including
URL-encoding in its most restrictive style ("QUERY" style since MediaWiki
1.17).

* does not start by a letter (this can be tested by comparing "lc:" to
"ucfirst:lc:" as they MUST be different (given that only ASCII letters are
allowed)

We could add other filters against some subpagenames codes passing this test,
such as "doc" or "layout", "testcases", "sandbox", used in templates (they are
not valid BCP47 language codes, except "doc"; unfortunately documentation
subpages of templates on English or Multilingual wikis use "/doc"; but for now
we have never encountered the need to translate to this encoded language)

We could also apply stricter rules (to make sure that they are also valid
domain name labels, i.e. at most 64 ASCII characters, no double hyphens, no
trailing hyphens, if we exclude IDNA labels interlanguage prefixes).

This means that all codes will be lowercase only (even if BCP47 codes are case
insensitive, this gives less false positives with accidental subpages that
could be created starting by a capital ASCII letter, such as:
  "User:Kennedy/Bob"

But the following page name will accidentally match Indonesian when "id" is a
subtemplate returnnin a numeric id, but is not a translation of
"Template/Page":
  "Template:Page/id"

We can hope that users trying to use common templates on their user subpages
will avoid naming them using sequences that could match valid language codes.
These few pages could be moved/redirected if needed: here it could be renamed:
  "Template:Page/Id"
so that it will no longer match a language code detected by the rules above.

Also, independantly of the language codes supported in MEdiaWiki and in the new
Translate extension, there are still lots of legacy codes used in subpages that
mean specific variants of languages (they don't always match the BCP47 rules,
but at least they should only use ASCII lowercase letters, hyphens, and digits,
and no spaces/undescores or quotation marks; the few existing pages depending
on these code could be reworked to change their codes to private codes
conforming to BCP47 rules)

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to