>> Even an annotated HTML DOM (using the data-* attributes for example) >> could be used. We might actually be able to off-load most >> context-sensitive parts of the parsing process to the browser's HTML >> parser by feeding it pre-tokenized HTML tag soup, for example via >> .innerHTML. > > I'm not sure what you are proposing -- are you suggesting that we let > some anomalies persist and let the browser take care of it?
Yes and no ;) I was speculating on the possibility of using the built-in HTML5 parser of modern browsers to implement part of our in-browser parsing pipeline, especially for the visual editor. When we feed tag soup produced by a CFG-based tokenizer to a modern browser (e.g., FF4+) with an HTML5 parser using .innerHTML, it will sanitize the input according to the HTML5 parser spec. If we then read the .innerHTML back, we'll get a sanitized serialization (see example at end). But we could just use the cleaned-up DOM fragment of course, and walk that and turn it into WikiDom. This is just an idea at this stage, and there might be more issues that sink it. Especially the preservation of overlaps in annotations might be tricky. HTML5 parsers break overlapping ranges up into non-overlapping ones, so they would need to be merged back together when building the WikiDom. Alternatively, there are Javascript libraries implementing the HTML5 parser spec which can be modified if plain HTML5 behavior is not ideal. > IMO we should be shooting for server APIs that give users very clean > data structures, so they can transform them however they like. HTML > should be just one of the output formats. I completely agree- on the server side, higher-level parsing into a suitable tree (DOM or else) or corresponding SAX events would be performed by a (possibly modified) HTML5 parser. The output of this parser is in no way limited to HTML. Gabriel Example in FF 4+: >>> document.body.innerHTML = "<b data-x='y'>bb<i>bbii</b>ii</i>" "<b data-x='y'>bb<i>bbii</b>ii</i>" >>> document.body.innerHTML "<b data-x="y">bb<i>bbii</i></b><i>ii</i>" _______________________________________________ Wikitext-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitext-l
