Thanks for the responses. I do want to convert HTML that cannot be assumed to be clean, so it sounds like Parsoid will not solve the problem for now.
--James On Fri, Nov 6, 2015 at 11:06 AM, Gabriel Wicke <gwi...@wikimedia.org> wrote: > To add to what Eric & Subbu have said, here is a link to the API > documentation for this end point: > > > https://en.wikipedia.org/api/rest_v1/?doc#!/Transforms/post_transform_html_to_wikitext_title_revision > > On Fri, Nov 6, 2015 at 8:47 AM, Subramanya Sastry <ssas...@wikimedia.org> > wrote: > > > On 11/06/2015 10:18 AM, James Montalvo wrote: > > > >> Can Parsoid be used to convert arbitrary HTML to wikitext? It's not > clear > >> to me whether it will only work with Parsoid's HTML+RDFa. I'm wondering > if > >> I could take snippets of HTML from non-MediaWiki webpages and convert > them > >> into wikitext. > >> > > > > The right answer is: "It depends" :-) > > > > As Eric responded in his reply, Parsoid does convert some kinds of > > arbitrary HTML to clean wikitext. See some additional examples at the end > > of this email. > > > > However, if you really threw arbitrary HTML at it (ex: <em>..</em> or > > <strong>..</strong>) Parsoid wouldn't know that it could potentially use > '' > > or ''' for those tags. Or, if you gave it input with all kinds of css and > > other inlined attributes, you won't necessarily get the best wikitext > from > > it. > > > > But, if you tried to convert HTML that you got from say Google docs, Open > > Office, Word, or other HTML-generation tools, the wikitext you get may > not > > be very pretty. > > > > We do want to keep improving Parsoid's abilities to get there, but it has > > not been a high priority for us, but it would be a great GSoC or > volunteer > > project if someone wants to play with this and improve this feature given > > that we are always playing catch up with all the other things we need to > > get done. > > > > But, if you didn't have really arbitrary HTML, you can get some > reasonable > > looking wikitext out of it even without the markers. But, things like > > images, templates, extensions .. obviously require the additional > > attributes for Parsoid to generate canonical wikitext for that. > > > > Hope this helps. > > > > Subbu. > > > > > > > ------------------------------------------------------------------------------------------- > > > > Some html -> wt examples: > > > > [subbu@earth bin] echo "<h2>foo</h2><p>a</p><p>b</p>" | node parse > > --html2wt > > == foo == > > a > > > > b > > [subbu@earth bin] echo "<a href='http://en.wikipedia.org/wiki/Hampi > '>Hampi</a>" > > | node parse --html2wt > > [[Hampi]] > > > > [subbu@earth bin] echo "<a href='http://it.wikipedia.org/wiki/Luna > '>Luna</a>" > > | node parse --html2wt > > [[:it:Luna|Luna]] > > > > [subbu@earth bin] echo "<a href='http://it.wikipedia.org/wiki/Luna > '>Luna</a>" > > | node parse --html2wt --prefix itwiki > > [[Luna]] > > > > [subbu@earth bin] echo "<ul><li>a</li><li>b</li><li>c</li></ul>" | node > > parse --html2wt > > * a > > * b > > * c > > > > [subbu@earth bin] echo <em>foo</em>" | node parse --html2wt > > <em>foo</em> > > > > > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > > -- > Gabriel Wicke > Principal Engineer, Wikimedia Foundation > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l