Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?
On 2016-10-17, eryk sun wrote: > On Mon, Oct 17, 2016 at 2:20 PM, Adam Funkwrote: >> I'm using IDLE 3 (with python 3.5.2) to work interactively with >> Twitter data, which of course contains emojis. Whenever the running >> program tries to print the text of a tweet with an emoji, it barfs >> this & stops running: >> >> UnicodeEncodeError: 'UCS-2' codec can't encode characters in >> position 102-102: Non-BMP character not supported in Tk >> >> Is there any way to set IDLE to ignore these characters (either drop >> them or replace them with something else) instead of throwing the >> exception? >> >> If not, what's the best way to strip them out of the string before >> printing? > > You can patch print() to transcode non-BMP characters as surrogate > pairs. For example: > > import builtins > > def print_ucs2(*args, print=builtins.print, **kwds): > args2 = [] > for a in args: > a = str(a) > if max(a) > '\u': > b = a.encode('utf-16le', 'surrogatepass') > chars = [b[i:i+2].decode('utf-16le', 'surrogatepass') > for i in range(0, len(b), 2)] > a = ''.join(chars) > args2.append(a) > print(*args2, **kwds) > > builtins._print = builtins.print > builtins.print = print_ucs2 > > On Windows this should allow printing non-BMP characters such as > emojis (e.g. U+0001F44C). On Linux it prints a non-BMP character as a > pair of empty boxes. If you're not using Windows you can modify this > to print something else for non-BMP characters, such as a replacement > character or \U literals. Clever, thanks. (I'm actually using Linux.) -- Consistently separating words by spaces became a general custom about the tenth century A. D., and lasted until about 1957, when FORTRAN abandoned the practice. --- Sun FORTRAN Reference Manual -- https://mail.python.org/mailman/listinfo/python-list
Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?
On Tue, Oct 18, 2016 at 2:09 AM, Chris Angelicowrote: > That's not a UTF-16 encoded byte string, though. It's a Unicode string > that contains two surrogates. So maybe the solution is to convert from > true Unicode strings into strings like the above - but if so, it > absolutely must not be done in any user-facing way. It should be an > implementation detail of Tkinter. Yes, it's an invalid Unicode string, since it contains surrogate codes. At the C level this gets passed as a UTF-16 string, even in Unix, i.e. in most cases a Tcl_UniChar is defined as a C unsigned short since the macro TCL_UTF_MAX defaults to 3 (UTF-8 bytes). As I said, I'm not experienced with TCL/Tk enough to know whether UTF-16 strings with surrogate pairs cause other problems. On Linux it prints the surrogate codes as empty box characters, which is certainly ugly and also incorrect to print two characters in place of one. It seems that TCL's UTF-8 conversion doesn't work with UTF-16. Thus supporting non-BMP characters would be limited to Windows until the default TCL_UTF_MAX is greater than 3 on Unix platforms. Supposedly this has actually worked in the core TCL implementation for some time, but extensions are holding it back. -- https://mail.python.org/mailman/listinfo/python-list
Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?
On Tue, Oct 18, 2016 at 10:23 AM, eryk sunwrote: > I don't know whether it causes problems elsewhere in Tk, but it has no > problem passing along a UTF-16 string to Windows. For example, see the > following with a breakpoint set on TextOut [1]: > > >>> root = tkinter.Tk() > >>> w = tkinter.Label(root, text='test: \ud83d\udc4c') > >>> w.pack() That's not a UTF-16 encoded byte string, though. It's a Unicode string that contains two surrogates. So maybe the solution is to convert from true Unicode strings into strings like the above - but if so, it absolutely must not be done in any user-facing way. It should be an implementation detail of Tkinter. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?
On Mon, Oct 17, 2016 at 8:35 PM, Random832wrote: > On Mon, Oct 17, 2016, at 14:20, eryk sun wrote: >> You can patch print() to transcode non-BMP characters as surrogate >> pairs. For example: >> >> On Windows this should allow printing non-BMP characters such as >> emojis (e.g. U+0001F44C). > > I thought there was some reason this wouldn't work with tk, or else > tkinter would do it already? I don't know whether it causes problems elsewhere in Tk, but it has no problem passing along a UTF-16 string to Windows. For example, see the following with a breakpoint set on TextOut [1]: >>> root = tkinter.Tk() >>> w = tkinter.Label(root, text='test: \ud83d\udc4c') >>> w.pack() Breakpoint 0 hit GDI32!TextOutW: 7fff`6d6c61d0 ff2532a10200jmp qword ptr [GDI32!_imp_TextOutW (7fff`6d6f0308)] ds:7fff`6d6f0308={gdi32full!TextOutW (7fff`6a3143c0)} 0:000> du @r9 00d6`dfdeea50 "test: .." 0:000> dw @r9 l8 00d6`dfdeea50 0074 0065 0073 0074 003a 0020 d83d dc4c The lpString parameter (x64 register r9) is the label's text, including the surrogate pair "\ud83d\udc4c" (i.e. U+0001F44C). [1]: https://msdn.microsoft.com/en-us/library/dd145133: -- https://mail.python.org/mailman/listinfo/python-list
Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?
On Mon, Oct 17, 2016, at 14:20, eryk sun wrote: > You can patch print() to transcode non-BMP characters as surrogate > pairs. For example: > > On Windows this should allow printing non-BMP characters such as > emojis (e.g. U+0001F44C). I thought there was some reason this wouldn't work with tk, or else tkinter would do it already? -- https://mail.python.org/mailman/listinfo/python-list
Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?
On Mon, Oct 17, 2016 at 2:20 PM, Adam Funkwrote: > I'm using IDLE 3 (with python 3.5.2) to work interactively with > Twitter data, which of course contains emojis. Whenever the running > program tries to print the text of a tweet with an emoji, it barfs > this & stops running: > > UnicodeEncodeError: 'UCS-2' codec can't encode characters in > position 102-102: Non-BMP character not supported in Tk > > Is there any way to set IDLE to ignore these characters (either drop > them or replace them with something else) instead of throwing the > exception? > > If not, what's the best way to strip them out of the string before > printing? You can patch print() to transcode non-BMP characters as surrogate pairs. For example: import builtins def print_ucs2(*args, print=builtins.print, **kwds): args2 = [] for a in args: a = str(a) if max(a) > '\u': b = a.encode('utf-16le', 'surrogatepass') chars = [b[i:i+2].decode('utf-16le', 'surrogatepass') for i in range(0, len(b), 2)] a = ''.join(chars) args2.append(a) print(*args2, **kwds) builtins._print = builtins.print builtins.print = print_ucs2 On Windows this should allow printing non-BMP characters such as emojis (e.g. U+0001F44C). On Linux it prints a non-BMP character as a pair of empty boxes. If you're not using Windows you can modify this to print something else for non-BMP characters, such as a replacement character or \U literals. -- https://mail.python.org/mailman/listinfo/python-list
Re: Making IDLE3 ignore non-BMP characters instead of throwing an exception?
On 2016-10-17, Adam Funk wrote: > I'm using IDLE 3 (with python 3.5.2) to work interactively with > Twitter data, which of course contains emojis. Whenever the running > program tries to print the text of a tweet with an emoji, it barfs > this & stops running: > > UnicodeEncodeError: 'UCS-2' codec can't encode characters in > position 102-102: Non-BMP character not supported in Tk > > Is there any way to set IDLE to ignore these characters (either drop > them or replace them with something else) instead of throwing the > exception? > > If not, what's the best way to strip them out of the string before > printing? Well, to answer part of my own question, this works for stripping them out: s = ''.join([c for c in s if ord(c)<65535]) -- Master Foo said: "A man who mistakes secrets for knowledge is like a man who, seeking light, hugs a candle so closely that he smothers it and burns his hand."--- Eric Raymond -- https://mail.python.org/mailman/listinfo/python-list
Making IDLE3 ignore non-BMP characters instead of throwing an exception?
I'm using IDLE 3 (with python 3.5.2) to work interactively with Twitter data, which of course contains emojis. Whenever the running program tries to print the text of a tweet with an emoji, it barfs this & stops running: UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 102-102: Non-BMP character not supported in Tk Is there any way to set IDLE to ignore these characters (either drop them or replace them with something else) instead of throwing the exception? If not, what's the best way to strip them out of the string before printing? Thanks, Adam -- https://mail.python.org/mailman/listinfo/python-list