[TurboGears] Re: util.to_unicode() and non-strings

Jorge Vargas Fri, 19 Dec 2008 13:24:50 -0800

On Fri, Dec 19, 2008 at 7:40 AM, Andi Albrecht
<[email protected]> wrote:
>
> On Fri, Dec 19, 2008 at 5:21 AM, Jorge Vargas <[email protected]> wrote:
>>
>> I'm not entirely sure how genshi handles this as I make sure all my
>> string-like types are unicode all the way in from the validator to the
>> DB, but if your use case is valid (read: could benefit the others) we
>> could add it as a 'helper function in tg2'
>
> Genshi's Markup class for example is a subclass of __builtin__.unicode
> and therefore it's very strict with strings (in terms of str)
> containing non-ASCII characters. E.g. "genshi.Markup('schön')" fails
> with a UnicodeDecodeError.
>
but this is failing even outside genshi, because in python2.x all
non-ASCII characters are invalid  for the str type, it is indeed one
of the reasons why python3.0 is so important, here is a little shell
session with the issue.


>>> s='schön'
>>> s
'sch\xc3\xb6n'
>>> 'schön'
'sch\xc3\xb6n'
>>> u'schön'
u'sch\xf6n'
>>> 'schön'.encode('UTF8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
3: ordinal not in range(128)
>>> u'schön'.encode('UTF8')
'sch\xc3\xb6n'
>>> u'schön'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 3: ordinal not in range(128)


the problem with python2.x is that even though ö isn't valid ASCII it
is accepted by the str object, which assumes ascii encoding, therefore
when you try to use it as a unicode string things go bad, because it
is invalid.

So essentially you are trying to fix the consequence not the problem.
again the solution is to make the value unicode before it reaches the
template, ideally inside the db and if it is not probably when you
insert it into the db in the first place.

That said I think a more generic function that will make sure the
value (str,float,int,etc.) is always valid unicode will be nice to
have around to get rid of those pesky encode bug until we can all go
to python3.0 and forget about it for ones and for all. Patches welcome
:)

> For me the to_unicode() function does a pretty good job to make sure
> that all string-like types are rendered correctly (even when taking
> care that all DB values are unicodes, I've made the experience that
> it's good to have such a function at hand ,-) I found it very usefull
> to have the same function both on the Python side and during template
> rendering. But IMO such a function is only useful if you don't need to
> specify input or output encodings each time this function is called.
> And so I'm not sure if there's a generic helper function for *all* use
> cases that does a proper unicode conversion (but maybe a common
> default implementation...).
>
>
>
>>
>> On Thu, Dec 18, 2008 at 4:28 PM, Andi Albrecht
>> <[email protected]> wrote:
>>>
>>> I agree, it'd be a cosmetic change and the docstring is clear. I've
>>> came across this function some time ago while looking for a handy
>>> "convert to unicode" function to use it in templates or at least to
>>> prepare some values for the template.
>>>
>>> My problem is solved by doing a string conversion before calling
>>> to_unicode (as you already suggested), but breaking validator code
>>> would be a bad side-effect ;-)
>>>
>>> Thanks for your reply,
>>>
>>> Andi
>>>
>>> On Thu, Dec 18, 2008 at 8:30 PM, Jorge Vargas <[email protected]> 
>>> wrote:
>>>>
>>>> On Thu, Dec 18, 2008 at 3:36 AM, Andi Albrecht
>>>> <[email protected]> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> the docstring of util.to_unicode() says clearly that it converts an
>>>>> "encoded string to unicode string". Unfortunately when the function is
>>>>> called with anything else than an instance of basestring (str,
>>>>> unicode) the input value is returned without any type conversion. It's
>>>>> actually not what I'd expected. What I'd expected is either that the
>>>>> function raises an exception if the input value isn't an instance of
>>>>> basestring or that the function tries to convert the input value into
>>>>> an unicode. For example, when calling "to_unicode(1L)" I would expect
>>>>> that the return value is u"1" but not 1L.
>>>>>
>>>>> Are there any concerns to add some simple type conversion. IMO something 
>>>>> like
>>>>>
>>>>> if not isinstance(value, basestring):
>>>>>  value = unicode(value)
>>>>> if isinstanve(value, str):
>>>>>  [...conversion as usual...]
>>>>> return value
>>>>>
>>>>> is better than returning value without even trying to convert it to an
>>>>> unicode as the function's name suggests.
>>>>>
>>>>> Andi
>>>>>
>>>> hello I assume you are talking about the 1.x branch.
>>>>
>>>> I don't consider this a bug, as you pointed out the docstring clearly
>>>> states it's a str-> unicode converter
>>>>
>>>> Now I don't think changing the behavior is good for two reasons
>>>> 1- util is really a package of util functions for TG not for client code
>>>> 2- this function is used here:
>>>> widgets/base.py:        params["value"] =
>>>> to_unicode(self.adjust_value(value, **params))
>>>>
>>>> therefore changing the function will mean that the widget will not
>>>> have ints and floats, which I'm certain are used by the validators.
>>>>
>>>> if you want this behavior you can do:
>>>>
>>>>>>> util.to_unicode(str(1L))
>>>> u'1'
>>>>
>>>> we could change the function name, after all it's used only at one
>>>> stop in the tg1 codebase ( at least that's what grep says) but why
>>>> break for a cosmetic change?
>>>>
>>>> >
>>>>
>>>
>>> >
>>>
>>
>> >
>>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"TurboGears" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/turbogears?hl=en
-~----------~----~----~----~------~----~------~--~---

[TurboGears] Re: util.to_unicode() and non-strings

Reply via email to