https://bugzilla.wikimedia.org/show_bug.cgi?id=22555
--- Comment #11 from Philippe Verdy <[email protected]> 2011-07-17 22:13:49 UTC --- (In reply to comment #5) > Hi Phillippe, > > This problem is nothing to do with Unicode... I've not written that. I just said that MediaWiki's PArserFunctions can already pass over UTF-8 byte sequences encoding a single code point, counting these sequences as a single character in {{padleft:}} and {{padright:}} when they are counting characters. They should then be able to pass over the "\x7FUNIQ-*" place-holders generated to protect spans of texts that must not be reparsed as Wiki-code. Note: if you use unstripBoth() when evaluating the parameters of padleft/padright; you risk ro reexpose some wiki code present in the returned string, because it will no longer be protected by the presence of the place holders, that will have disappeared in the source string, before it is padded. That's exactly why the code of padleft: and pad-right: have to pass over the place holders when counting characters, if they have to truncate the source string that contains too many characters (no such counting is necessary if truncation does not occur, but only padding characters are added. When the maximum number of characters is reached, it should also be able to know if the last actual character counted was part of a nowiki section (protected by place-holders) or not (note that characters present in nowiki sections ARE to be counted, between these place-holders). If the last character counted that reaches the maximum length was not within a nowiki section, truncation can occur at the position of this character. Otherwise, a new placeholder has to be generated to correctly maintain the nowiki restriction on the remaining part of the protected section. This requires special code to handle the byte \x7F (and pass over the whole placeholder), and then counting characters that are present in the referenced string that the placeholder represents. This code must know for each position if it falls within a nowiki section or not. For now the existing code just effectively ignores the presence of placeholders, counting them as if they were normal characters. It just knows how to parse UTF-8 sequences when scanning the parameter, using a very basic method (it just looks if each byte has its two highest bits 7 and 6 set or not, to see if it's an UTF-8 trailing byte or not, because every trailing byte between \x80 and \xBF is simply not counted), but it makes no other special case when the parsed byte is \x7F. Note that using unstripBoth() will be less efficient in memory than using the string parsing loop, because it will force the parameter string to be reallocated and copied, even before actually performing the string truncation or padding. This is not necessary, as padleft/padright can directly work on the source string to perform either the string truncation (at the correct position), or the padding (in a new string), or returning the parameter itself (when the source string contains exactly the requested number of characters (not counting the placeholders and trailing bytes). Writing this parsing loop is quite simple, to do, but requires managing an additional state (if the current character position is in a nowiki section or not, as signaled by the presence of placeholders whose UTF-8 encoded syntax only uses a fixed number of ASCII-only bytes). --- Anyway, I do think that the "UNIQ" encoding of nowiki placeholders is overkill (in fact this syntax is really overlong) when these nowiki sections are empty: it should be enough to represent these empty nowikis (that are used for example to protect leading or trailing spaces from whitespace stripping or interpretation as special wiki syntax at the beginning of a paragraph) only simply as "\x7x\x7F" (you don't need anything else for empty nowiki). The double DEL is still needed only for disambiguating normal text from the inner unique id of a represented non-empty nowiki section, because these placeholders are also encoded in a pair of DEL characters). You could even represent an empty nowiki as a single control byte "\x01" (which will be simply stripped out when generating the final HTML after the wiki code parsing). The "UNIQ" id represented internally by the full placeholder is ONLY needed to represent a reference to a protected non-empty substring stored elsewhere in some strings table (this encoding is made so that the value of the id is preserved even if the text is converted to other case). When performing successive substitutions of templates, we should not even use text appending (the whole final text can be reconstructed, recursively, by using these ids representing stable substrings that need no further parsing in calling templates or parser functions : this could speed up a lot the generation of pages without having to copy each time large amounts of texts across template substitutions). --- Note that for parser functions that evaluate the actual text of the parameter in order to infer a value (for example the first parameter of #expr or #ifexpr, or the first parameter and case parameters of #switch before "="), only these fully processed parameters can effectively use unstripBoth(). This unstripBoth() should not be done for parameters that will be returned unchanged (if they are returned), such as the second or third parameter of #if* or the returned values in parameters 2 (or more) of #switch after the equal sign... For #switch, unstripboth() should only be done after detecting the presence of the equal sign and splitting it. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
