https://bugzilla.wikimedia.org/show_bug.cgi?id=22555

--- Comment #11 from Philippe Verdy <[email protected]> 2011-07-17 22:13:49 
UTC ---
(In reply to comment #5)
> Hi Phillippe,
> 
> This problem is nothing to do with Unicode...

I've not written that. I just said that MediaWiki's PArserFunctions can already
pass over UTF-8 byte sequences encoding a single code point, counting these
sequences as a single character in {{padleft:}} and {{padright:}} when they are
counting characters. They should then be able to pass over the "\x7FUNIQ-*"
place-holders generated to protect spans of texts that must not be reparsed as
Wiki-code.

Note: if you use unstripBoth() when evaluating the parameters of
padleft/padright; you risk ro reexpose some wiki code present in the returned
string, because it will no longer be protected by the presence of the place
holders, that will have disappeared in the source string, before it is padded.

That's exactly why the code of padleft: and pad-right: have to pass over the
place holders when counting characters, if they have to truncate the source
string that contains too many characters (no such counting is necessary if
truncation does not occur, but only padding characters are added.

When the maximum number of characters is reached, it should also be able to
know if the last actual character counted was part of a nowiki section
(protected by place-holders) or not (note that characters present in nowiki
sections ARE to be counted, between these place-holders).

If the last character counted that reaches the maximum length was not within a
nowiki section, truncation can occur at the position of this character.
Otherwise, a new placeholder has to be generated to correctly maintain the
nowiki restriction on the remaining part of the protected section.

This requires special code to handle the byte \x7F (and pass over the whole
placeholder), and then counting characters that are present in the referenced
string that the placeholder represents. This code must know for each position
if it falls within a nowiki section or not.

For now the existing code just effectively ignores the presence of
placeholders, counting them as if they were normal characters. It just knows
how to parse UTF-8 sequences when scanning the parameter, using a very basic
method (it just looks if each byte has its two highest bits 7 and 6 set or not,
to see if it's an UTF-8 trailing byte or not, because every trailing byte
between \x80 and \xBF is simply not counted), but it makes no other special
case when the parsed byte is \x7F.

Note that using unstripBoth() will be less efficient in memory than using the
string parsing loop, because it will force the parameter string to be
reallocated and copied, even before actually performing the string truncation
or padding. This is not necessary, as padleft/padright can directly work on the
source string to perform either the string truncation (at the correct
position), or the padding (in a new string), or returning the parameter itself
(when the source string contains exactly the requested number of characters
(not counting the placeholders and trailing bytes).

Writing this parsing loop is quite simple, to do, but requires managing an
additional state (if the current character position is in a nowiki section or
not, as signaled by the presence of placeholders whose UTF-8 encoded syntax
only uses a fixed number of ASCII-only bytes).

---

Anyway, I do think that the "UNIQ" encoding of nowiki placeholders is overkill
(in fact this syntax is really overlong) when these nowiki sections are empty:
it should be enough to represent these empty nowikis (that are used for example
to protect leading or trailing spaces from whitespace stripping or
interpretation as special wiki syntax at the beginning of a paragraph) only
simply as "\x7x\x7F" (you don't need anything else for empty nowiki).

The double DEL is still needed only for disambiguating normal text from the
inner unique id of a represented non-empty nowiki section, because these
placeholders are also encoded in a pair of DEL characters). You could even
represent an empty nowiki as a single control byte "\x01" (which will be simply
stripped out when generating the final HTML after the wiki code parsing).

The "UNIQ" id represented internally by the full placeholder is ONLY needed to
represent a reference to a protected non-empty substring stored elsewhere in
some strings table (this encoding is made so that the value of the id is
preserved even if the text is converted to other case).

When performing successive substitutions of templates, we should not even use
text appending (the whole final text can be reconstructed, recursively, by
using these ids representing stable substrings that need no further parsing in
calling templates or parser functions : this could speed up a lot the
generation of pages without having to copy each time large amounts of texts
across template substitutions).

---

Note that for parser functions that evaluate the actual text of the parameter
in order to infer a value (for example the first parameter of #expr or #ifexpr,
or the first parameter and case parameters of #switch before "="), only these
fully processed parameters can effectively use unstripBoth().

This unstripBoth() should not be done for parameters that will be returned
unchanged (if they are returned), such as the second or third parameter of #if*
or the returned values in parameters 2 (or more) of #switch after the equal
sign... For #switch, unstripboth() should only be done after detecting the
presence of the equal sign and splitting it.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to