https://bugzilla.wikimedia.org/show_bug.cgi?id=40267

Marcin Cieślak <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |parser
                URL|                            |https://de.wikipedia.org/w/
                   |                            |index.php?title=Bahnhof_Aac
                   |                            |hen_Schanz
                 CC|                            |[email protected]
          Component|Database                    |Parser

--- Comment #1 from Marcin Cieślak <[email protected]> 2012-09-20 
20:35:09 UTC ---
This is intentional. There is a function in the parser, called
replaceUnusualEscapes, that normalizes the URL by removing all URL escapes that
are not prescribed in the RFC 1738 are dequoted, so only characters outside of
ASCII range (32,127) and those having some meaning in the URL are escaped
<>"#{}|\^~[]`;/?

https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=blob;f=includes/parser/Parser.php;h=59d379a06ea8bc16fe24dfa28754688d1b8d1247;hb=HEAD#l1624

The rationale is this:

    * Convert unnecessary URL escape codes in external links to their
equivalent
      character before doing anything with them. This prevents certain kinds of
      spam filter evasion. (Parser.php only)

https://gerrit.wikimedia.org/r/gitweb?p=mediawiki%2Fcore.git&h=eb53cc08560721208e195c0f073809e7b3eee485

RFC 3986 defines :/?#[]@ as generic delimiters and !$&'()*+,;= as delimiters
that can be used by particular schemes (or more). It also defines "unreserved
characters":

Berners-Lee, et al.         Standards Track                    [Page 12]

RFC 3986                   URI Generic Syntax               January 2005


2.3.  Unreserved Characters

   Characters that are allowed in a URI but do not have a reserved
   purpose are called unreserved.  These include uppercase and lowercase
   letters, decimal digits, hyphen, period, underscore, and tilde.

      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

   URIs that differ in the replacement of an unreserved character with
   its corresponding percent-encoded US-ASCII octet are equivalent: they
   identify the same resource.  However, URI comparison implementations
   do not always perform normalization prior to comparison (see Section
   6).  For consistency, percent-encoded octets in the ranges of ALPHA
   (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
   underscore (%5F), or tilde (%7E) should not be created by URI
   producers and, when found in a URI, should be decoded to their
   corresponding unreserved characters by URI normalizers.


But here only externalinks table is affected, which is useful for tracking
links. The URLs in the wikitext are displayed and linked as they were.
Therefore  I am not this is a bug. What's the problem?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to