Re: [Wikitech-l] GSoC project advice: port texvc to Python?

Aryeh Gregor Tue, 23 Mar 2010 09:44:34 -0700

On Tue, Mar 23, 2010 at 4:06 AM, Damon Wang <[email protected]> wrote:
> I'm interested in porting texvc to Python, and I was hoping this list
> here might help me hash out the plan. Please let me know if I should
> take my questions elsewhere.

Python is much better than OCaml, and I prefer Python to PHP, but a
PHP implementation would be preferable for core IMO.  Not all
MediaWiki developers know Python, but all obviously know PHP.  If you
did a Python implementation, though, then at least someone could
translate it to PHP pretty easily.

> 1. Collect test cases and write a testing script
> Thanks to avar from #wikimedia, I already have the <math>...</math> bits
> from enwiki and dewiki. I would also construct some simpler ones by hand
> to test each of the acceptable LaTeX commands.
>
> Would there be any possibility of logging the input seen by texvc on a
> production instance of Mediawiki, so I could get some invalid input
> submitted by actual users?
>
> This could also be useful to future maintainers for regression testing.

If you have a Unix box handy, it's pretty easy to install MediaWiki
with math support so you can test yourself.  sudo apt-get install
mediawiki mediawiki-math should do it on anything Debian-based, for
example.

> 2. Implement an AMS-TeX validator
> I'll probably use PLY because it's rumored to have helpful debugging
> features (designed for a first-year compilers class, apparently). ANTLR
> is another popular option, but this guy
>    http://www.bearcave.com/software/antlr/antlr_expr.html
> thinks it's complicated and hard to debug. I've never used either, so if
> anyone on this list knows of a good Python parsing package I'd welcome
> suggestions.

If it's in PHP, you'd probably have to write a parser yourself, but
LaTeX is pretty easy to parse, I'd think.

> 4. Add HTML rendering to texvc and test script
> I don't even understand how the existing texvc decides whether HTML is
> good enough. It looks like the original programmer just decreed that
> certain LaTeX commands could be rendered to HTML, and defaults to PNG if
> it sees anything not on that list. How important is this feature?

Fairly important, IMO, if the goal is to replace texvc, although not
critical.  <math>x</math> shouldn't render x as a PNG -- that's silly.

> Python doesn't have parsing just locked right down the way C does with
> flex/bison, but there are some good options, I have the most experience
> with it, and I think I'd be able to complete the port faster in Python
> than in either of the other languages. I was tempted at first to port to
> PHP, to conform with the rest of Mediawiki, but there don't seem to be
> any good parsing packages for PHP. (Please tell me if that's wrong.)

Would it really be very hard to write a LaTeX parser in PHP?  I'd
think it could be done easily, if you permit only a carefully-selected
subset.  I don't think you'd need any parser theory, just use
preg_split() and loop through all the tokens.

> I'd appreciate any advice or criticism. Since my only previous
> experience has been using Wikipedia and setting up a test Mediawiki
> instance for my ACM chapter, I'm only just now learning my way around
> the code base and it's not always evident why things were done as they
> are. Does this look like a reasonable and worthwhile project?

Rewriting texvc in PHP would be a nice project to have, which is small
enough in scope that I'm optimistic that it could be done in a summer.
 I'd say it's a good choice.

On Tue, Mar 23, 2010 at 6:23 AM, Conrad Irwin
<[email protected]> wrote:
> I am not too fussed about the HTML output, though I can't speak for
> everyone, at the moment it seems that many more of the Unicode
> characters should be let through (at least at some level of HTML),
> though I don't know enough about worldwide unicode support.

I suspect we need to be about as conservative as we currently are for
platforms like IE6 on XP.  We should be able to expand the range of
HTML characters in the future, though.

> A good PHP parser library would be exceptionally useful for MediaWiki
> (and many extensions), at the moment we have loads of methods that do
> regex "parsing", so if you felt like writing one... :D.

Wouldn't a real generic parser implementation written in PHP be too
slow to be useful?  preg_replace() has the advantage of being
implemented in C.

> I am
> less convinced of the utility of a Python port, OCaml is a great
> language for implementing this, and I fear a lot of your time would be
> wasted trying to make the Python similarly nice. As you note, MediaWiki
> is not written in Python, doing this in PHP would be a larger step in
> the right direction, though without such nice frameworks, maybe less
> nice to do.

OCaml might be a great language for implementing this, but very few of
us understand it.  texvc has been totally unmaintained for years,
other than new things being added to the whitelist sometimes by means
of cargo-culting what previous commits do.  Rewriting texvc in
*anything* that more people understand would be a step forward.

On Tue, Mar 23, 2010 at 8:31 AM, Roan Kattouw <[email protected]> wrote:
> As
> mentioned on the same bug, shelling out to Lilypond has certain issues
> with unbounded time/CPU/memory usage.

The same is true for LaTeX.  Lilypond would just need a parser and
filter to whitelist safe constructs, like LaTeX does.

On Tue, Mar 23, 2010 at 12:25 PM, Damon Wang <[email protected]> wrote:
> I've never used PHP for real programming, but how difficult would it be
> to write a really simple, stupid first pass at a DFA parser? I suspect
> I'd need much more than three months to make it useful, but would it be
> possible to implement some coherent subset of the features? E.g.,
> building the LR0 automaton, at least?

I don't think you'd need a "real" parser here.  Mostly we just use
preg_split() for this sort of thing.  I'm not familiar with formal
grammars and such, so I can't say what the concrete disadvantages of
that approach are.

> I suggested a Python port because
>    http://www.mediawiki.org/wiki/Summer_of_Code_2010#MediaWiki_core
> lists it as a potential project idea. I was under the impression that
> people around here did not want to leave texvc in OCaml. Is this wrong?

No, it's right.  Conrad is crazy.  :P

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] GSoC project advice: port texvc to Python?

Reply via email to