Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?

2016-10-21 Thread Adam Funk
On 2016-10-17, eryk sun wrote:

> On Mon, Oct 17, 2016 at 2:20 PM, Adam Funk  wrote:
>> I'm using IDLE 3 (with python 3.5.2) to work interactively with
>> Twitter data, which of course contains emojis.  Whenever the running
>> program tries to print the text of a tweet with an emoji, it barfs
>> this & stops running:
>>
>>   UnicodeEncodeError: 'UCS-2' codec can't encode characters in
>>   position 102-102: Non-BMP character not supported in Tk
>>
>> Is there any way to set IDLE to ignore these characters (either drop
>> them or replace them with something else) instead of throwing the
>> exception?
>>
>> If not, what's the best way to strip them out of the string before
>> printing?
>
> You can patch print() to transcode non-BMP characters as surrogate
> pairs. For example:
>
> import builtins
>
> def print_ucs2(*args, print=builtins.print, **kwds):
> args2 = []
> for a in args:
> a = str(a)
> if max(a) > '\u':
> b = a.encode('utf-16le', 'surrogatepass')
> chars = [b[i:i+2].decode('utf-16le', 'surrogatepass')
>  for i in range(0, len(b), 2)]
> a = ''.join(chars)
> args2.append(a)
> print(*args2, **kwds)
>
> builtins._print = builtins.print
> builtins.print = print_ucs2
>
> On Windows this should allow printing non-BMP characters such as
> emojis (e.g. U+0001F44C). On Linux it prints a non-BMP character as a
> pair of empty boxes. If you're not using Windows you can modify this
> to print something else for non-BMP characters, such as a replacement
> character or \U literals.

Clever, thanks.  (I'm actually using Linux.)

-- 
Consistently separating words by spaces became a general custom about
the tenth century A. D., and lasted until about 1957, when FORTRAN
abandoned the practice.  --- Sun FORTRAN Reference Manual
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?

2016-10-17 Thread eryk sun
On Tue, Oct 18, 2016 at 2:09 AM, Chris Angelico  wrote:
> That's not a UTF-16 encoded byte string, though. It's a Unicode string
> that contains two surrogates. So maybe the solution is to convert from
> true Unicode strings into strings like the above - but if so, it
> absolutely must not be done in any user-facing way. It should be an
> implementation detail of Tkinter.

Yes, it's an invalid Unicode string, since it contains surrogate
codes. At the C level this gets passed as a UTF-16 string, even in
Unix, i.e. in most cases a Tcl_UniChar is defined as a C unsigned
short since the macro TCL_UTF_MAX defaults to 3 (UTF-8 bytes).

As I said, I'm not experienced with TCL/Tk enough to know whether
UTF-16 strings with surrogate pairs cause other problems. On Linux it
prints the surrogate codes as empty box characters, which is certainly
ugly and also incorrect to print two characters in place of one. It
seems that TCL's UTF-8 conversion doesn't work with UTF-16. Thus
supporting non-BMP characters would be limited to Windows until the
default TCL_UTF_MAX is greater than 3 on Unix platforms. Supposedly
this has actually worked in the core TCL implementation for some time,
but extensions are holding it back.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?

2016-10-17 Thread Chris Angelico
On Tue, Oct 18, 2016 at 10:23 AM, eryk sun  wrote:
> I don't know whether it causes problems elsewhere in Tk, but it has no
> problem passing along a UTF-16 string to Windows. For example, see the
> following with a breakpoint set on TextOut [1]:
>
> >>> root = tkinter.Tk()
> >>> w = tkinter.Label(root, text='test: \ud83d\udc4c')
> >>> w.pack()

That's not a UTF-16 encoded byte string, though. It's a Unicode string
that contains two surrogates. So maybe the solution is to convert from
true Unicode strings into strings like the above - but if so, it
absolutely must not be done in any user-facing way. It should be an
implementation detail of Tkinter.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?

2016-10-17 Thread eryk sun
On Mon, Oct 17, 2016 at 8:35 PM, Random832  wrote:
> On Mon, Oct 17, 2016, at 14:20, eryk sun wrote:
>> You can patch print() to transcode non-BMP characters as surrogate
>> pairs. For example:
>>
>> On Windows this should allow printing non-BMP characters such as
>> emojis (e.g. U+0001F44C).
>
> I thought there was some reason this wouldn't work with tk, or else
> tkinter would do it already?

I don't know whether it causes problems elsewhere in Tk, but it has no
problem passing along a UTF-16 string to Windows. For example, see the
following with a breakpoint set on TextOut [1]:

>>> root = tkinter.Tk()
>>> w = tkinter.Label(root, text='test: \ud83d\udc4c')
>>> w.pack()

Breakpoint 0 hit
GDI32!TextOutW:
7fff`6d6c61d0 ff2532a10200jmp
qword ptr [GDI32!_imp_TextOutW (7fff`6d6f0308)]
ds:7fff`6d6f0308={gdi32full!TextOutW (7fff`6a3143c0)}

0:000> du @r9
00d6`dfdeea50  "test: .."

0:000> dw @r9 l8
00d6`dfdeea50  0074 0065 0073 0074 003a 0020 d83d dc4c

The lpString parameter (x64 register r9) is the label's text,
including the surrogate pair "\ud83d\udc4c" (i.e. U+0001F44C).

[1]: https://msdn.microsoft.com/en-us/library/dd145133:
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?

2016-10-17 Thread Random832
On Mon, Oct 17, 2016, at 14:20, eryk sun wrote:
> You can patch print() to transcode non-BMP characters as surrogate
> pairs. For example:
> 
> On Windows this should allow printing non-BMP characters such as
> emojis (e.g. U+0001F44C).

I thought there was some reason this wouldn't work with tk, or else
tkinter would do it already?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?

2016-10-17 Thread eryk sun
On Mon, Oct 17, 2016 at 2:20 PM, Adam Funk  wrote:
> I'm using IDLE 3 (with python 3.5.2) to work interactively with
> Twitter data, which of course contains emojis.  Whenever the running
> program tries to print the text of a tweet with an emoji, it barfs
> this & stops running:
>
>   UnicodeEncodeError: 'UCS-2' codec can't encode characters in
>   position 102-102: Non-BMP character not supported in Tk
>
> Is there any way to set IDLE to ignore these characters (either drop
> them or replace them with something else) instead of throwing the
> exception?
>
> If not, what's the best way to strip them out of the string before
> printing?

You can patch print() to transcode non-BMP characters as surrogate
pairs. For example:

import builtins

def print_ucs2(*args, print=builtins.print, **kwds):
args2 = []
for a in args:
a = str(a)
if max(a) > '\u':
b = a.encode('utf-16le', 'surrogatepass')
chars = [b[i:i+2].decode('utf-16le', 'surrogatepass')
 for i in range(0, len(b), 2)]
a = ''.join(chars)
args2.append(a)
print(*args2, **kwds)

builtins._print = builtins.print
builtins.print = print_ucs2

On Windows this should allow printing non-BMP characters such as
emojis (e.g. U+0001F44C). On Linux it prints a non-BMP character as a
pair of empty boxes. If you're not using Windows you can modify this
to print something else for non-BMP characters, such as a replacement
character or \U literals.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?

2016-10-17 Thread Adam Funk
On 2016-10-17, Adam Funk wrote:

> I'm using IDLE 3 (with python 3.5.2) to work interactively with
> Twitter data, which of course contains emojis.  Whenever the running
> program tries to print the text of a tweet with an emoji, it barfs
> this & stops running:
>
>   UnicodeEncodeError: 'UCS-2' codec can't encode characters in
>   position 102-102: Non-BMP character not supported in Tk
>
> Is there any way to set IDLE to ignore these characters (either drop
> them or replace them with something else) instead of throwing the
> exception?
>
> If not, what's the best way to strip them out of the string before
> printing?

Well, to answer part of my own question, this works for stripping them
out:

 s = ''.join([c for c in s if ord(c)<65535])



-- 
Master Foo said: "A man who mistakes secrets for knowledge is like
a man who, seeking light, hugs a candle so closely that he smothers
it and burns his hand."--- Eric Raymond
-- 
https://mail.python.org/mailman/listinfo/python-list


Making IDLE3 ignore non-BMP characters instead of throwing an exception?

2016-10-17 Thread Adam Funk
I'm using IDLE 3 (with python 3.5.2) to work interactively with
Twitter data, which of course contains emojis.  Whenever the running
program tries to print the text of a tweet with an emoji, it barfs
this & stops running:

  UnicodeEncodeError: 'UCS-2' codec can't encode characters in
  position 102-102: Non-BMP character not supported in Tk

Is there any way to set IDLE to ignore these characters (either drop
them or replace them with something else) instead of throwing the
exception?

If not, what's the best way to strip them out of the string before
printing?

Thanks,
Adam
-- 
https://mail.python.org/mailman/listinfo/python-list