[web2py] Re: Issues with TAG() encoding and XML().flatten()

jotbe Tue, 13 Sep 2011 12:25:25 -0700

Thanks again, but my issue deals with interpreting a html string that
should be flatten to Markmin, so I have to pass the string like
'<h1>öäß<h1>'. The encoding is messed up afterwards:


>>> print TAG(u'<h1>öäß</h1>'.encode('utf8'))
<h1>Ã¶Ã¤Ã</h1>
>>> print TAG['h1'](u'öäß'.encode('utf8')).xml()
<h1>öäß</h1>
>>> print TAG['h1'](u'öäß'.encode('utf8')).flatten(render=markmin_serializer)
# öäß


>>> print TAG(u'<h1>öäß</h1>'.encode('utf8')).flatten(render=markmin_serializer)
# Ã¶Ã¤Ã

Maybe I could workaround this issue, if I know how to separate the
semantics from my string '<h1>öäß</h1>', so I can use it like TAG['h1']
(u'öäß'...)

Jan

On 13 Sep., 20:47, Massimo Di Pierro <[email protected]>
wrote:
> AG['h1'](u'öäß'.encode('utf8')).xml()
>
> On Sep 13, 11:13 am, "jot.be" <[email protected]> wrote:
>
>
>
>
>
>
>
> > I searched the ML and found a thread that mentions a similar issue with
> > TAG() and Unicode:
>
> >http://groups.google.com/group/web2py/browse_thread/thread/a716d6d77b...
>
> > I cannot reproduce the described issues with TAG[tagname](input), but I
> > still have the problem with passing a Unicode string to TAG(input):
>
> > >>> TAG['h1'](u'öäß').xml()
>
> > '<h1>\xc3\xb6\xc3\xa4\xc3\x9f</h1>'>>> print TAG['h1'](u'öäß').xml()
> > <h1>öäß</h1>
> > >>> print TAG[u'hö'](u'<öäß').xml()
>
> > <hö>&lt;öäß</hö>>>> print TAG(u'<h1>öäß</h1>').xml()
>
> > Traceback (most recent call last):
> >   File "<console>", line 1, in <module>
> >   File "/Users/jan/hg/web2py/gluon/html.py", line 1054, in __call__
> >     return web2pyHTMLParser(decoder.decoder(html)).tree
> >   File "/Users/jan/hg/web2py/gluon/decoder.py", line 74, in decoder
> >     return buffer.decode(encoding).encode('utf8')
> >   File
> > "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ 
> > encodings/utf_8.py",
> > line 16, in decode
> >     return codecs.utf_8_decode(input, errors, True)
> > UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-6:
> > ordinal not in range(128)
>
> > Really strange. Hints are still welcome. ;)
>
> > Jan
>
> > On Mon, Sep 12, 2011 at 10:17 AM, jot.be <[email protected]> wrote:
> > > Hi Massimo,
>
> > > thanks for your answer!
>
> > > On Mon, Sep 12, 2011 at 2:19 AM, Massimo Di Pierro <
> > > [email protected]> wrote:
>
> > >> Are you sure your input is UTF8? The web2py markmin_serializer is in
> > >> gluon/html.py and it is relatively straightforward. Nothing can really
> > >> go bad there. I suspect your input has not been parsed at all into the
> > >> web2py object representation.
>
> > > Not sure - I am quite new to Python and am not so skilled in dealing with
> > > Unicode issues.
>
> > > I just appended my last approaches:
> > >https://gist.github.com/caec7bd5b41624d50b01#gistcomment-50227
>
> > > For my understanding it is not the input that TAG(input) expects. When
> > > using classic html entities ("&auml;") it works.
>
> > > Any hint what I could try next? :)
>
> > >> the parsing is done by TAG(input) (not by XML(input)) and it is based
> > >> on the python built-in XML parser which chokes on non-utf8 chars. It
> > >> may not be parsting the XML at all and returning the XML as a single
> > >> string.
>
> > > OK, this is clear.
>
> > > Jan
>
> > >> Massimo
>
> > >> On Sep 11, 2:01 pm, jotbe <[email protected]> wrote:
> > >> > Hi List,
>
> > >> > I just started my first Web2Py sample project (the Wiki from the book)
> > >> > and got it even managed to integrate the HTML5 editor Aloha:
> > >>http://aloha-editor.org/
>
> > >> > My pages should use Markmin instead of HTML and therefore I am
> > >> > converting the HTML to Markmin using TAG().flatten() and
> > >> > markmin_serializer. In general it is working and the content is stored
> > >> > as Markmin code, but when using eg. German umlauts like 'öä', TAG()
> > >> > seems to get confused and doesn't handle the encoding properly.
>
> > >> > On the other hand, when trying to use
> > >> > XML().flatten(render=markmin_serializer) instead of
> > >> > TAG().flatten(render=markmin_serializer), nothing changes at all.
> > >> > XML().flatten(render=markmin_serializer) will return the input HTML
> > >> > string as is, instead of converting it to Markmin.
>
> > >> > I am trying to solve this issue for two days now and read lots of
> > >> > posts regarding handling of UTF-8 in Python, tried lots of third party
> > >> > modules to workaround this issue, but had no luck so far. I really
> > >> > appreciate your help/tips. :)
>
> > >> > Various sample code using the Web2Py Shell:
> > >>https://gist.github.com/caec7bd5b41624d50b01
>
> > >> > Thanks in advance!

[web2py] Re: Issues with TAG() encoding and XML().flatten()

Reply via email to