https://bugzilla.wikimedia.org/show_bug.cgi?id=40267
Marcin Cieślak <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |parser URL| |https://de.wikipedia.org/w/ | |index.php?title=Bahnhof_Aac | |hen_Schanz CC| |[email protected] Component|Database |Parser --- Comment #1 from Marcin Cieślak <[email protected]> 2012-09-20 20:35:09 UTC --- This is intentional. There is a function in the parser, called replaceUnusualEscapes, that normalizes the URL by removing all URL escapes that are not prescribed in the RFC 1738 are dequoted, so only characters outside of ASCII range (32,127) and those having some meaning in the URL are escaped <>"#{}|\^~[]`;/? https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=blob;f=includes/parser/Parser.php;h=59d379a06ea8bc16fe24dfa28754688d1b8d1247;hb=HEAD#l1624 The rationale is this: * Convert unnecessary URL escape codes in external links to their equivalent character before doing anything with them. This prevents certain kinds of spam filter evasion. (Parser.php only) https://gerrit.wikimedia.org/r/gitweb?p=mediawiki%2Fcore.git&h=eb53cc08560721208e195c0f073809e7b3eee485 RFC 3986 defines :/?#[]@ as generic delimiters and !$&'()*+,;= as delimiters that can be used by particular schemes (or more). It also defines "unreserved characters": Berners-Lee, et al. Standards Track [Page 12] RFC 3986 URI Generic Syntax January 2005 2.3. Unreserved Characters Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde. unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource. However, URI comparison implementations do not always perform normalization prior to comparison (see Section 6). For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers. But here only externalinks table is affected, which is useful for tracking links. The URLs in the wikitext are displayed and linked as they were. Therefore I am not this is a bug. What's the problem? -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
