I searched the ML and found a thread that mentions a similar issue with TAG() and Unicode:
http://groups.google.com/group/web2py/browse_thread/thread/a716d6d77bf6e933?pli=1 I cannot reproduce the described issues with TAG[tagname](input), but I still have the problem with passing a Unicode string to TAG(input): >>> TAG['h1'](u'öäß').xml() '<h1>\xc3\xb6\xc3\xa4\xc3\x9f</h1>' >>> print TAG['h1'](u'öäß').xml() <h1>öäß</h1> >>> print TAG[u'hö'](u'<öäß').xml() <hö><öäß</hö> >>> print TAG(u'<h1>öäß</h1>').xml() Traceback (most recent call last): File "<console>", line 1, in <module> File "/Users/jan/hg/web2py/gluon/html.py", line 1054, in __call__ return web2pyHTMLParser(decoder.decoder(html)).tree File "/Users/jan/hg/web2py/gluon/decoder.py", line 74, in decoder return buffer.decode(encoding).encode('utf8') File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-6: ordinal not in range(128) Really strange. Hints are still welcome. ;) Jan On Mon, Sep 12, 2011 at 10:17 AM, jot.be <[email protected]> wrote: > Hi Massimo, > > thanks for your answer! > > On Mon, Sep 12, 2011 at 2:19 AM, Massimo Di Pierro < > [email protected]> wrote: > >> Are you sure your input is UTF8? The web2py markmin_serializer is in >> gluon/html.py and it is relatively straightforward. Nothing can really >> go bad there. I suspect your input has not been parsed at all into the >> web2py object representation. >> >> > Not sure - I am quite new to Python and am not so skilled in dealing with > Unicode issues. > > I just appended my last approaches: > https://gist.github.com/caec7bd5b41624d50b01#gistcomment-50227 > > For my understanding it is not the input that TAG(input) expects. When > using classic html entities ("ä") it works. > > Any hint what I could try next? :) > > >> the parsing is done by TAG(input) (not by XML(input)) and it is based >> on the python built-in XML parser which chokes on non-utf8 chars. It >> may not be parsting the XML at all and returning the XML as a single >> string. >> > > OK, this is clear. > > Jan > > >> >> Massimo >> >> On Sep 11, 2:01 pm, jotbe <[email protected]> wrote: >> > Hi List, >> > >> > I just started my first Web2Py sample project (the Wiki from the book) >> > and got it even managed to integrate the HTML5 editor Aloha: >> http://aloha-editor.org/ >> > >> > My pages should use Markmin instead of HTML and therefore I am >> > converting the HTML to Markmin using TAG().flatten() and >> > markmin_serializer. In general it is working and the content is stored >> > as Markmin code, but when using eg. German umlauts like 'öä', TAG() >> > seems to get confused and doesn't handle the encoding properly. >> > >> > On the other hand, when trying to use >> > XML().flatten(render=markmin_serializer) instead of >> > TAG().flatten(render=markmin_serializer), nothing changes at all. >> > XML().flatten(render=markmin_serializer) will return the input HTML >> > string as is, instead of converting it to Markmin. >> > >> > I am trying to solve this issue for two days now and read lots of >> > posts regarding handling of UTF-8 in Python, tried lots of third party >> > modules to workaround this issue, but had no luck so far. I really >> > appreciate your help/tips. :) >> > >> > Various sample code using the Web2Py Shell: >> https://gist.github.com/caec7bd5b41624d50b01 >> > >> > Thanks in advance! >> > >

