I searched the ML and found a thread that mentions a similar issue with
TAG() and Unicode:

http://groups.google.com/group/web2py/browse_thread/thread/a716d6d77bf6e933?pli=1

I cannot reproduce the described issues with TAG[tagname](input), but I
still have the problem with passing a Unicode string to TAG(input):

>>> TAG['h1'](u'öäß').xml()
'<h1>\xc3\xb6\xc3\xa4\xc3\x9f</h1>'
>>> print TAG['h1'](u'öäß').xml()
<h1>öäß</h1>
>>> print TAG[u'hö'](u'<öäß').xml()
<hö>&lt;öäß</hö>
>>> print TAG(u'<h1>öäß</h1>').xml()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/jan/hg/web2py/gluon/html.py", line 1054, in __call__
    return web2pyHTMLParser(decoder.decoder(html)).tree
  File "/Users/jan/hg/web2py/gluon/decoder.py", line 74, in decoder
    return buffer.decode(encoding).encode('utf8')
  File
"/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py",
line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-6:
ordinal not in range(128)

Really strange. Hints are still welcome. ;)

Jan

On Mon, Sep 12, 2011 at 10:17 AM, jot.be <[email protected]> wrote:

> Hi Massimo,
>
> thanks for your answer!
>
> On Mon, Sep 12, 2011 at 2:19 AM, Massimo Di Pierro <
> [email protected]> wrote:
>
>> Are you sure your input is UTF8? The web2py markmin_serializer is in
>> gluon/html.py and it is relatively straightforward. Nothing can really
>> go bad there. I suspect your input has not been parsed at all into the
>> web2py object representation.
>>
>>
> Not sure - I am quite new to Python and am not so skilled in dealing with
> Unicode issues.
>
> I just appended my last approaches:
> https://gist.github.com/caec7bd5b41624d50b01#gistcomment-50227
>
> For my understanding it is not the input that TAG(input) expects. When
> using classic html entities ("&auml;") it works.
>
> Any hint what I could try next? :)
>
>
>> the parsing is done by TAG(input) (not by XML(input)) and it is based
>> on the python built-in XML parser which chokes on non-utf8 chars. It
>> may not be parsting the XML at all and returning the XML as a single
>> string.
>>
>
> OK, this is clear.
>
> Jan
>
>
>>
>> Massimo
>>
>> On Sep 11, 2:01 pm, jotbe <[email protected]> wrote:
>> > Hi List,
>> >
>> > I just started my first Web2Py sample project (the Wiki from the book)
>> > and got it even managed to integrate the HTML5 editor Aloha:
>> http://aloha-editor.org/
>> >
>> > My pages should use Markmin instead of HTML and therefore I am
>> > converting the HTML to Markmin using TAG().flatten() and
>> > markmin_serializer. In general it is working and the content is stored
>> > as Markmin code, but when using eg. German umlauts like 'öä', TAG()
>> > seems to get confused and doesn't handle the encoding properly.
>> >
>> > On the other hand, when trying to use
>> > XML().flatten(render=markmin_serializer) instead of
>> > TAG().flatten(render=markmin_serializer), nothing changes at all.
>> > XML().flatten(render=markmin_serializer) will return the input HTML
>> > string as is, instead of converting it to Markmin.
>> >
>> > I am trying to solve this issue for two days now and read lots of
>> > posts regarding handling of UTF-8 in Python, tried lots of third party
>> > modules to workaround this issue, but had no luck so far. I really
>> > appreciate your help/tips. :)
>> >
>> > Various sample code using the Web2Py Shell:
>> https://gist.github.com/caec7bd5b41624d50b01
>> >
>> > Thanks in advance!
>>
>
>

Reply via email to