[Bug 46516] New: Generalise the Parsoid structure and internal representations.

bugzilla-daemon Sun, 24 Mar 2013 15:13:29 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=46516


       Web browser: ---
            Bug ID: 46516
           Summary: Generalise the Parsoid structure and internal
                    representations.
           Product: Parsoid
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: JS/General
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected],
                    [email protected], [email protected],
                    [email protected]
            Blocks: 37933, 37934
    Classification: Unclassified
   Mobile Platform: ---

So Molly and I are at LibrePlanet talking about bug 37933.

We decided that while Wikitext --> HTML --> LaTeX is possible, it's not
terribly useful and doesn't really take advantage of the structure of Parsoid -
we see it as adding an extra stage to the parse, which will potentially add to
the time the parse takes, as opposed to having a generalized token stream and
only starting to convert to a format after the token stream is actually ready.

Obviously this means a few of big things, potentially:

1. The DOM post processor needs to either run before the HTML5 tree builder, on
tokens or some other structure, or it needs to be emulated for each format. I'm
leaning towards the former, because if we're going to export to multiple
formats it would make more sense to have one file for each format that builds
the export from a token structure, rather than two files each, which build the
export and do the postprocessing.

2. Because we aren't actually dealing with HTML, necessarily, in the end, we
shouldn't be talking about tokens with HTML-specific tag names. Probably we
could just use canonical Parsoid-specific names - something like
http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary - or maybe something
similar to the *_NODE attributes in DOM nodes, with a mapper to some canonical
integer values that are defined in the base Token class.

Footnote: As I was thinking about this and trying to come up with how I wanted
it to look, I realized that the problem was that I was looking at it as wanting
to convert between WT and either LaTeX or HTML, but if we wound up following
our long term plan, "LaTeX export" would also require HTML-to-LaTeX, because
HTML would be our storage mechanism. So I think it might be better to rewrite
each bit of our system to convert each format to and from a canonical internal
representation, rather than to and from any one other format.

I'm posting here because I want thoughts and feedback. It should be noted that
bug 37934 would also benefit from any of the work we did on the generalisation
problem - and we could probably open a tracking bug to figure out all of these
things more generally.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are watching all bug changes.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 46516] New: Generalise the Parsoid structure and internal representations.

Reply via email to