Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Subramanya Sastry Mon, 01 Aug 2016 22:16:48 -0700

TL:DR; You get to a spec by paying down technical debt that untangleswikitext parsing from being intricately tied to the internals ofmediawiki implementation and state.

In discussions, there is far too much focus on the fact that you cannotwrite a BNF grammar or yacc / lex / bison / whatever or that quoteparsing is context-sensitive. I don't think it is as much of a big deal.For example, you could use Markdown for parsing but that doesn't changemuch of the picture outlined below ... I think all of that is less of anissue compared to the following:


Right now, mediawiki HTML output depends on the following:
* input wikitext
* wiki config (including installed extensions)
* installed templates
* media resources (images, audio, video)

* PHP parser hooks that expose parsing internals and implementationdetails (not replicable in other parsers)

* wiki messages (ex: cite output)
* state of the corpus and other db state (ex: red links, bad images)
* user state (prefs, etc.)
* Tidy

So, one reason for the complexity in implementing a wikitext parser isbecause the output HTML is not simply a straightforward transformationof input wikitext (and some config). There is far too much other statethat gets in the way.

The second reason for complexity is because markup errors aren't boundedto narrow contexts, but, can leak out and impact output of the entirepage. Some user pages seem to exploit this as a feature even (uncloseddiv tags).

The third source of complexity is because some parser hooks exposeinternals of the implementation (Before/After Strip/Tidy and other suchhooks). An implementation without tidy or that handles wikitextdifferent might not have the same pipeline.

However, we can still get to a spec that is much more replicable if westart cleaning up some of this incrementally and paying down technicaldebt. Here are some things going on right now towards that.


* We are close to getting rid of Tidy which removes it from the equation.

* There are RFCs that propose defining DOM scopes and propose thatoutput of templates (and extensions) be a DOM (vs a string), with somecaveats (that I will ignore for here). If we can get to implementingthese, we immediately isolate the parsing of a top-level page from thedetails of how extensions and transclusions are processed.* RFCs that propose that things like red links, bad images, user state,site messages not be an input into the core wikitext parse. From aspec-point of view, they should be viewed as post-processingtransformations. However, for efficiency reasons, an implementationmight choose to integrate that as part of the parse, but that is not arequirement.


Separately, here is one other thing we can consider:
* Deprecate and replace tag hooks that expose parser internals.

When all of these are done, it become far more feasible to think ofdefining a spec for wikitext parsing that is not tied to the internalsof mediawiki or its extensions. At that point, you could implementtemplating via Lua or via JS or via Ruby ... the specifics areimmaterial. What matters is those templating implementations andextensions produce output with certain properties. You can then specifythat mediawiki-HTML is a series of transformations that are applied tothe output of the wikitext parser ... and where there can be multiplespec-compliant implementations of that parser.

I think it is feasible to get there. But, whether we want a spec forwikitext and should work towards that is a different question.


Subbu.

On 08/01/2016 08:34 PM, Gergo Tisza wrote:

On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier <ro...@wikimedia.org> wrote:

Do you believe that declaring "the implementation is the spec" is a
sustainable way of encouraging contribution to our projects?


Reimplementing Wikipedia's parser (complete with template inclusions,
Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is
practically impossible. What we do or do not declare won't change that.

There are many other, more realistic ways to encourage contribution by
users who are interested in wikis, but not in Wikimedia projects.
(Supporting Markdown would certainly be one of them.) But historically the
WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no
other actor has been both willing and able to step up in its place.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l




_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

Reply via email to