[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Wed, 04 Aug 2010 10:10:27 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #205 from Philippe Verdy <verd...@wanadoo.fr> 2010-08-04 17:09:44 
UTC ---
(In reply to comment #204)
> If people want to put crazy stuff in sortkeys that changes based on who's
> viewing it, we can't stop them.  Curly braces are evaluated at an earlier 
> stage
> than category links, so we can't make them behave differently based on whether
> they're being used in a sortkey, I don't think.  You could also put
> {{CURRENTTIME}} in sortkeys, or any other silly thing like that.

That's because the Mediawiki parser still operates on the wrong level, and
performs text subtitutions always without ingoring the context of use. Not only
this is bad, but this is also very inefficient, because each processing level
is converting a fullpage to another fullpage that needs to be reparsed again at
the next level.

A much better scheme based on a true gramatical analyser using a finite state
automata would really help defining the state at which the parser (or its
ParserFunctions extensions) operate, without ever having to create huge
complete page buffers between each level (which uses costly appending
operations with many memory reallocations).

In other words, when you start parsing "[[" you enter in a "start-of-link"
state, which then parses "category:" until it has found a colon, in which case
the whole prefix is case-folded and goes to the "in-category" state, then the
parser scans up to the next "{" or "|" or "]". It can then correctly process
all the text using such rules.

I have suggested since long that the MediaWiki syntax should be reformulated
using a LALR(1) formal syntax, from which a finite-state automata can be
automatically built to cover all the state information, and then ported to PHP
(Yacc, Bison or even better PCTCS could do that without difficulty). Then
instead of calling parsing functions that process all the text, then return the
converted text for processing to the next level, it will do that in a much
simpler (and faster) processing loop, calling much simple generation methods
and altering its state in a much cleaner and faster way (no more need to append
small lexical items to various buffers, the atoms will be pased from level to
level using a chained implementation.

This would also significantly speedup the expansion of templates, and would
allow the parser to make distintions when {{int:....}} is encountered in the
context of a [[category:...]] (which would have temporarily forced the UI
language to the CONTENTLANGUAGE, until the final "]]" token is encountered that
would restore the UI language within the parser's internal state. this would
also have significant performance benefits (for example, no more need to
convert and expand all the parameters of {{#if:...}} or {{#switch:...}}, only
convert them lazily when they are really needed for generating the effective
output.

Yes this comment is going too far out of the topic of this bug, it is
architectural. PHP has all the rools needed to support the construction of
tree-like data: instead of passing only strings to parser generation functions,
you would pass it an ordered array, whose elements are the parsed parameters,
themselves being either strings or subparsed arrays containing strings and
other arrays. The builtin functions would then request back to the MediaWiki
parser the evaluation/expansion of ONLY the parameters they need, and MediaWiki
would still be able to call the expansions of ONLY these parameters,
recursively. Most of the items in the wikicode would then be atomic and
processed lazily, only when they are effectively needed.

The "crazy" things like {{CURRENTTIME}} or {{time:...}} or {{int:...}} found
within [[category:...]] and whose result depends on time or on user preferences
would be easily avoided. This would also simplify a lot the management of
whitespaces (if they need ot be trimmed and/or compressed depends on the
builtin expansion function called by the parser.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to