Re: unicode question
On Wed, Jan 28, 2015 8:21 AM CET Terry Reedy wrote: On 1/27/2015 12:17 AM, Rehab Habeeb wrote: Hi there python staff does python support arabic language for texts ? and what to do if it support it? i wrote hello in Arabic using codeskulptor and the powershell just for testing and the same error appeared( a sytanx error in unicode)!! I do not know how complete the support is, but this is copied from 3.4.2, which uses tcl/tk 8.6. t = الحركات for c in t: print(c) # Prints rightmost char above first ا ل ح ر ك ا ت Wow, I never knew this was so clever. Is that with or without an RTL marker? The following StackOverflow question and response indicate that there may b more issue, but it was asked before tcl/tk 8.6 was available, so the answer may be partially obsolete. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: unicode question
On 01/28/2015 03:17 PM, Albert-Jan Roskam wrote: I do not know how complete the support is, but this is copied from 3.4.2, which uses tcl/tk 8.6. t = الحركات for c in t: print(c) # Prints rightmost char above first ا ل ح ر ك ا ت Wow, I never knew this was so clever. Is that with or without an RTL marker? I don't think this has anything to do with Python. Python is simply spitting out unicode characters as it sees them, starting at string position 0 and working to the end. The magic is done by whatever is displaying the utf-8 output from Python. If I copy this text to the clipboard, t = hi there, الحركات! and paste it in my terminal (say to Python's shell), which is not BIDI aware, I get the Arabic letters in reverse order. I tried to paste it here but no matter what I do thunderbird goes into BIDI mode and makes them appear right. -- https://mail.python.org/mailman/listinfo/python-list
Re: unicode question
On Tue, Jan 27, 2015, at 12:25, Mark Lawrence wrote: People might find this http://bugs.python.org/issue1602 and hence this https://github.com/Drekin/win-unicode-console useful. The latter is available on pypi. However, Arabic is one of those scripts that runs up against the real limitations of the windows console. At least on non-Arabic versions of Windows, you'll just get a sequence of boxes, and it won't do any bidirectional processing either. I have no idea what, if anything, it would do differently on Arabic versions of Windows. -- https://mail.python.org/mailman/listinfo/python-list
Re: unicode question
On 1/27/2015 12:17 AM, Rehab Habeeb wrote: Hi there python staff does python support arabic language for texts ? and what to do if it support it? i wrote hello in Arabic using codeskulptor and the powershell just for testing and the same error appeared( a sytanx error in unicode)!! I do not know how complete the support is, but this is copied from 3.4.2, which uses tcl/tk 8.6. t = الحركات for c in t: print(c) # Prints rightmost char above first ا ل ح ر ك ا ت The following StackOverflow question and response indicate that there may b more issue, but it was asked before tcl/tk 8.6 was available, so the answer may be partially obsolete. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: unicode question
On Tue, Jan 27, 2015, at 00:17, Rehab Habeeb wrote: Hi there python staff does python support arabic language for texts ? and what to do if it support it? i wrote hello in Arabic using codeskulptor and the powershell just for testing and the same error appeared( a sytanx error in unicode)!! Python itself supports arabic just fine, but the MS Windows console in general, and Python's implementation of it in particular, have poor support for many aspects of unicode, so it's important to define exactly what you are trying to do. -- https://mail.python.org/mailman/listinfo/python-list
Re: unicode question
On 27/01/2015 16:13, random...@fastmail.us wrote: On Tue, Jan 27, 2015, at 00:17, Rehab Habeeb wrote: Hi there python staff does python support arabic language for texts ? and what to do if it support it? i wrote hello in Arabic using codeskulptor and the powershell just for testing and the same error appeared( a sytanx error in unicode)!! Python itself supports arabic just fine, but the MS Windows console in general, and Python's implementation of it in particular, have poor support for many aspects of unicode, so it's important to define exactly what you are trying to do. People might find this http://bugs.python.org/issue1602 and hence this https://github.com/Drekin/win-unicode-console useful. The latter is available on pypi. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: unicode question
On Tue, Jan 27, 2015 at 4:17 PM, Rehab Habeeb moonlight06082...@gmail.com wrote: Hi there python staff does python support arabic language for texts ? and what to do if it support it? i wrote hello in Arabic using codeskulptor and the powershell just for testing and the same error appeared( a sytanx error in unicode)!! If you're using Python 3, you have very good support for non-ASCII text, including Arabic. In Python 2, you can work with Unicode data, but your variable/function names all have to be in ASCII. What was your code, and what was the error? Copy and paste them into the email, and we'll be better able to help you. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
unicode question
Hi there python staff does python support arabic language for texts ? and what to do if it support it? i wrote hello in Arabic using codeskulptor and the powershell just for testing and the same error appeared( a sytanx error in unicode)!! -- https://mail.python.org/mailman/listinfo/python-list
Beginner python 3 unicode question
Example interactive: $ python3 Python 3.3.1 (default, Sep 25 2013, 19:29:01) [GCC 4.7.3] on linux Type help, copyright, credits or license for more information. import uuid import base64 base64.b32encode(uuid.uuid1().bytes)[:-6].lower() b'zsz653co6ii6hgjejqhw42ncgy' But when I put the same thing into a source file I get this: Traceback (most recent call last): File /home/gandalf/Python/Lib/shopzeus/yaaf/ui/widget.py, line 94, in __init__ self.eid = uniqueid() File /home/gandalf/Python/Lib/shopzeus/yaaf/ui/__init__.py, line 34, in uniqueid base64.b32encode(uuid.uuid1().bytes)[:-6].lower() TypeError: Can't convert 'bytes' object to str implicitly Why it is behaving differently on the command line? What should I do to fix this? -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- https://mail.python.org/mailman/listinfo/python-list
Re: Beginner python 3 unicode question
On 16-11-2013 20:12, Laszlo Nagy wrote: Example interactive: $ python3 Python 3.3.1 (default, Sep 25 2013, 19:29:01) [GCC 4.7.3] on linux Type help, copyright, credits or license for more information. import uuid import base64 base64.b32encode(uuid.uuid1().bytes)[:-6].lower() b'zsz653co6ii6hgjejqhw42ncgy' But when I put the same thing into a source file I get this: Traceback (most recent call last): File /home/gandalf/Python/Lib/shopzeus/yaaf/ui/widget.py, line 94, in __init__ self.eid = uniqueid() File /home/gandalf/Python/Lib/shopzeus/yaaf/ui/__init__.py, line 34, in uniqueid base64.b32encode(uuid.uuid1().bytes)[:-6].lower() TypeError: Can't convert 'bytes' object to str implicitly Why it is behaving differently on the command line? What should I do to fix this? the error is in one of the lines you did not copy here because this works without problems: BEGIN-of script #!/usr/bin/python import uuid import base64 print base64.b32encode(uuid.uuid1().bytes)[:-6].lower() END-of script But, i need to say, i'm also a beginner ;) -- https://mail.python.org/mailman/listinfo/python-list
Re: Beginner python 3 unicode question
the error is in one of the lines you did not copy here because this works without problems: BEGIN-of script #!/usr/bin/python Most probably, your /usr/bin/python program is python version 2, and not python version 3 Try the same program with /usr/bin/python3. And also try the interactive mode with the same program and I think you will see the same phenomenon. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- https://mail.python.org/mailman/listinfo/python-list
Re: Beginner python 3 unicode question
Why it is behaving differently on the command line? What should I do to fix this? I was experimenting with this a bit more and found some more confusing things. Can somebody please enlight me? Here is a test function: def password_hash(self,password): public = bytearray([random.randint(0,255) for _ in range(5)]) private = bytearray([random.randint(0,255)]) pwd = bytearray(password.encode()) digest = hashlib.sha1(public+pwd+private).digest() print(digest,digest,type(digest)) print(de,digest.encode()) # and some more stuff here... This function was called inside a script, and gave me this: ('digest', '\xa0\x98\x8b\xff\x04\xf9V;\xbd\x1eIHzh\x10-\xc5!\x14\x1b', type 'str') Traceback (most recent call last): File /home/gandalf/Python/Lib/shopzeus/scripts/yaaf_pwmgr.py, line 478, in module pwmgr.run(parser,args) File /home/gandalf/Python/Lib/shopzeus/scripts/yaaf_pwmgr.py, line 241, in run self.authdb.user_create(name,password,propvalues) File /home/gandalf/Python/Lib/shopzeus/yaaf/db/authdb.py, line 205, in user_create password:(password and Binary(self.password_hash(password))) or None, File /home/gandalf/Python/Lib/shopzeus/yaaf/db/authdb.py, line 134, in password_hash print(de,digest.encode()) UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128) Then I have tried the very same thing from the interactive shell: gandalf@gandalf-HP-G62-Notebook-PC:~/Python/Projects/appserver$ python3 Python 3.3.1 (default, Sep 25 2013, 19:29:01) [GCC 4.7.3] on linux Type help, copyright, credits or license for more information. digest = '\xa0\x98\x8b\xff\x04\xf9V;\xbd\x1eIHzh\x10-\xc5!\x14\x1b' digest.encode() b'\xc2\xa0\xc2\x98\xc2\x8b\xc3\xbf\x04\xc3\xb9V;\xc2\xbd\x1eIHzh\x10-\xc3\x85!\x14\x1b' WHAT??? Seems like the default value of the encoding parameter of the str.encode method is different if I start it interactively. But this contradicts its documentation: print(digest.encode.__doc__) S.encode(encoding='utf-8', errors='strict') - bytes Encode S using the codec registered for encoding. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. Default is 'strict' meaning that encoding errors raise a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and 'xmlcharrefreplace' as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors. So is the default utf-8 or not? Should the documentation be updated? Or do we have a bug in the interactive shell? -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- https://mail.python.org/mailman/listinfo/python-list
Re: Beginner python 3 unicode question
On 16-11-2013 21:57, Laszlo Nagy wrote: the error is in one of the lines you did not copy here because this works without problems: BEGIN-of script #!/usr/bin/python Most probably, your /usr/bin/python program is python version 2, and not python version 3 Try the same program with /usr/bin/python3. And also try the interactive mode with the same program and I think you will see the same phenomenon. adding some '()' helped: BEGIN-of script #!/usr/bin/python3 import uuid import base64 print (base64.b32encode(uuid.uuid1().bytes)[:-6].lower()) END-of script ~/temp python3 --version Python 3.3.0 -- https://mail.python.org/mailman/listinfo/python-list
Re: Beginner python 3 unicode question [SOLVED]
So is the default utf-8 or not? Should the documentation be updated? Or do we have a bug in the interactive shell? It was my fault, sorry. The other program used os.system at some places, and it accidentally used python2 instead of python 3. :-( -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- https://mail.python.org/mailman/listinfo/python-list
Re: Beginner python 3 unicode question
On Sun, Nov 17, 2013 at 8:19 AM, Laszlo Nagy gand...@shopzeus.com wrote: print(digest,digest,type(digest)) This function was called inside a script, and gave me this: ('digest', '\xa0\x98\x8b\xff\x04\xf9V;\xbd\x1eIHzh\x10-\xc5!\x14\x1b', type 'str') This looks very much like you're running under Python 2. Take care of which interpreter you're running; that might be because of your shebang (as Luuk mentioned), or because of what you're typing to invoke the script; either way, it makes a huge difference. The easiest solution is probably to invoke the interpreter explicitly: Interactive mode: $ python3 Script mode: $ python3 scriptname.py But you seem to have something WAY more complex than a single script. What's the setup? How is Python getting invoked? If your code is getting imported by something else, no shebang will help you - you need the other code to be being executed by the other interpreter. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Beginner python 3 unicode question [SOLVED]
On Sun, Nov 17, 2013 at 8:44 AM, Laszlo Nagy gand...@shopzeus.com wrote: So is the default utf-8 or not? Should the documentation be updated? Or do we have a bug in the interactive shell? It was my fault, sorry. The other program used os.system at some places, and it accidentally used python2 instead of python 3. :-( Oh! Didn't see this post before responding. Oh well. Maybe someone else one day will make use of the other. :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
tkinter unicode question
Just curious if anyone could shed some light on this? I'm using tkinter, but I can't seem to get certain unicode characters to show in the label for Python 3. In my test, the label and button will contain the same 3 characters - a Greek Alpha, a Greek Omega with a circumflex and soft breathing accent, and then a Greek Alpha with a soft breathing accent. For Python 2.6, this works great: # -*- coding: utf-8 -*- from Tkinter import * root = Tk() Label(root, text=u'\u03B1 \u1F66 \u1F00').pack() Button(root, text=u'\u03B1 \u1F66 \u1F00').pack() root.mainloop() However, for Python 3.1.2, the button gets the correct characters, but the label only displays the first Greek Alpha character. The other 2 characters look like Chinese characters followed by an empty box. Here's the code for Python 3: # -*- coding: utf-8 -*- from tkinter import * root = Tk() Label(root, text='\u03B1 \u1F66 \u1F00').pack() Button(root, text='\u03B1 \u1F66 \u1F00').pack() root.mainloop() I've done some research and am wondering if it is because Python 2.6 comes with tk version 8.5, while Python 3.1.2 comes with tk version 8.4? I'm running this on OS X 10.6.4. Here's a link I found that mentions this same problem: http://www.mofeel.net/871-comp-lang-python/5879.aspx If I need to upgrade tk to 8.5, is it best to upgrade it or just install 'tiles'? From my readings it looks like upgrading to 8.5 can be a pain due to OS X still pointing back to 8.4. I haven't tried it yet in case someone might have an easier solution. Thanks for looking at my question. Jay -- http://mail.python.org/mailman/listinfo/python-list
Re: tkinter unicode question
In article 20100727204532.r7gmz.27213.r...@cdptpa-web20-z02, jyoun...@kc.rr.com wrote: Just curious if anyone could shed some light on this? I'm using tkinter, but I can't seem to get certain unicode characters to show in the label for Python 3. In my test, the label and button will contain the same 3 characters - a Greek Alpha, a Greek Omega with a circumflex and soft breathing accent, and then a Greek Alpha with a soft breathing accent. For Python 2.6, this works great: # -*- coding: utf-8 -*- from Tkinter import * root = Tk() Label(root, text=u'\u03B1 \u1F66 \u1F00').pack() Button(root, text=u'\u03B1 \u1F66 \u1F00').pack() root.mainloop() However, for Python 3.1.2, the button gets the correct characters, but the label only displays the first Greek Alpha character. The other 2 characters look like Chinese characters followed by an empty box. Here's the code for Python 3: # -*- coding: utf-8 -*- from tkinter import * root = Tk() Label(root, text='\u03B1 \u1F66 \u1F00').pack() Button(root, text='\u03B1 \u1F66 \u1F00').pack() root.mainloop() I've done some research and am wondering if it is because Python 2.6 comes with tk version 8.5, while Python 3.1.2 comes with tk version 8.4? I'm running this on OS X 10.6.4. Most likely. Apparently you're using the Apple-supplied Python 2.6 which, as you say, uses Tk 8.5. If you had installed the python.org 2.6, it would likely fail for you in the same way as 3.1, since both use Tk 8.4. (They both fail for me.) If I need to upgrade tk to 8.5, is it best to upgrade it or just install 'tiles'? From my readings it looks like upgrading to 8.5 can be a pain due to OS X still pointing back to 8.4. I haven't tried it yet in case someone might have an easier solution. OS X 10.6 comes with both Tk 8.4 and 8.5. The problem is that the Python Tkinter(2.6) or tkinter(3.1) is linked at build time, not install time, to one or the other. You would need to at least rebuild and relink tkinter for 3.1 to use Tk 8.5, which means downloading and building Python from source. New releases of python.org installers are now coming in two varieties: the second will be only for 10.6 or later and will link with Tk 8.5. The next new release of Python 3 is likely months away, though. In the meantime, a simpler solution might be to download and install the ActiveState Python 3.1 for OS X which does use Tk 8.5. And your test case works for me with it. -- Ned Deily, n...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Another (simple) unicode question
Construct http://construct.wikispaces.com/ is a kick-ass binary file structurer (written by a 21 year old!) I thought of trying to port it to python3 but it barfs on some unicode related stuff (after running 2to3) which I am unable to wrap my head around. Can anyone direct me to what I should read to try to understand this? -- http://mail.python.org/mailman/listinfo/python-list
Re: Another (simple) unicode question
On Oct 29, 10:02 pm, Rustom Mody rustompm...@gmail.com wrote: Constructhttp://construct.wikispaces.com/is a kick-ass binary file structurer (written by a 21 year old!) I thought of trying to port it to python3 but it barfs on some unicode related stuff (after running 2to3) which I am unable to wrap my head around. Can anyone direct me to what I should read to try to understand this? unicode related stuff is rather vague. Have you read the Python Unicode HOWTO? Joel Spolsky's article? http://www.amk.ca/python/howto/unicode http://www.joelonsoftware.com/articles/Unicode.html In any case, it's a debugging problem, isn't it? Could you possibly consider telling us the error message, the traceback, a few lines of the 3.x code around where the problem is, and the corresponding 2.x lines? Are you using 3.1.1 and 2.6.4? Does your test work in 2.6? -- http://mail.python.org/mailman/listinfo/python-list
Re: Another (simple) unicode question
On Oct 29, 4:02 am, Rustom Mody rustompm...@gmail.com wrote: Constructhttp://construct.wikispaces.com/is a kick-ass binary file structurer (written by a 21 year old!) I thought of trying to port it to python3 but it barfs on some unicode related stuff (after running 2to3) which I am unable to wrap my head around. 2to3 isn't a general Python 2 to Python 3 translator. You can't pass any old Python 2.x code through 2to3 and expect it to work. Rather, you have to write the Python 2.x code in a subset of Python that I call transitional dialect. In order to port to Python 3 using 2to3, you first have to port it to this transitional dialect. If Unicode is the issue, one thing you should do to explicitly classify all strings as binary or text in Python 2.x. This means to change str() to unicode() or bytes(), whichever is appropriate, and to change to u or b. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list
Re: Another (simple) unicode question
John Machin wrote: On Oct 29, 10:02 pm, Rustom Mody rustompm...@gmail.com wrote:... I thought of trying to port it to python3 but it barfs on some unicode related stuff (after running 2to3) which I am unable to wrap my head around. Can anyone direct me to what I should read to try to understand this? to which Jon replied with some good links to start, and then: In any case, it's a debugging problem, isn't it? Could you possibly consider telling us the error message, the traceback, a few lines of the 3.x code around where the problem is, and the corresponding 2.x lines? Are you using 3.1.1 and 2.6.4? Does your test work in 2.6? Also consider how 2to3 translates the problem section(s). --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
En Wed, 28 Oct 2009 02:28:01 -0300, Chris Jones cjns1...@gmail.com escribió: On Tue, Oct 27, 2009 at 06:21:11AM EDT, Lie Ryan wrote: Chris Jones wrote: Best part of Unicode is that there are multiple encodings, right? ;-) No, the best part about Unicode is there is no encoding! Unicode does not define any encoding; RFC 3629: ISO/IEC 10646 and Unicode define several encoding forms of their common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. what it defines is code-points for characters which is not related to how characters are encoded in files or network transmission. In other words, Unicode is not related to any encoding .. and yet the UTF-8, UTF-16.. encoding forms are clearly related to Unicode. How is that possible? Start reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky. http://www.joelonsoftware.com/articles/Unicode.html -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
Chris Jones cjns1...@gmail.com wrote in message news:mailman.2149.1256707687.2807.python-l...@python.org... On Tue, Oct 27, 2009 at 06:21:11AM EDT, Lie Ryan wrote: Chris Jones wrote: [..] Best part of Unicode is that there are multiple encodings, right? ;-) No, the best part about Unicode is there is no encoding! Unicode does not define any encoding; RFC 3629: ISO/IEC 10646 and Unicode define several encoding forms of their common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. what it defines is code-points for characters which is not related to how characters are encoded in files or network transmission. In other words, Unicode is not related to any encoding .. and yet the UTF-8, UTF-16.. encoding forms are clearly related to Unicode. How is that possible? CJ When I first saw it, my first thought was that the subjectline was an oxymoron. --Tim Arnold -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On Tue, Oct 27, 2009 at 06:21:11AM EDT, Lie Ryan wrote: Chris Jones wrote: [..] Best part of Unicode is that there are multiple encodings, right? ;-) No, the best part about Unicode is there is no encoding! Unicode does not define any encoding; RFC 3629: ISO/IEC 10646 and Unicode define several encoding forms of their common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. what it defines is code-points for characters which is not related to how characters are encoded in files or network transmission. In other words, Unicode is not related to any encoding .. and yet the UTF-8, UTF-16.. encoding forms are clearly related to Unicode. How is that possible? CJ -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
Chris Jones wrote: On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote: [..] Characters outside the 16-bit range aren't supported on all builds. They won't be supported on most Windows builds, as Windows uses 16-bit Unicode extensively: I knew nothing about UTF-16 friends before this thread. Best part of Unicode is that there are multiple encodings, right? ;-) No, the best part about Unicode is there is no encoding! Unicode does not define any encoding; what it defines is code-points for characters which is not related to how characters are encoded in files or network transmission. -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
En Wed, 21 Oct 2009 15:14:32 -0300, ru...@yahoo.com escribió: On Oct 21, 4:59 am, Bruno Desthuilliers bruno. 42.desthuilli...@websiteburo.invalid wrote: beSTEfar a écrit : (snip) When parsing strings, use Regular Expressions. And now you have _two_ problems g For some simple parsing problems, Python's string methods are powerful enough to make REs overkill. And for any complex enough parsing (any recursive construct for example - think XML, HTML, any programming language etc), REs are just NOT enough by themselves - you need a full blown parser. But keep in mind that many XML, HTML, etc parsing problems are restricted to a subset where you know the nesting depth is limited (often to 0 or 1), and for that large set of problems, RE's *are* enough. I don't think so. Nesting isn't the only problem. RE's cannot handle comments, by example. And you must support unquoted attributes, single and double quotes, any attribute ordering, empty tags, arbitrary whitespace... If you don't, you are not reading XML (or HTML), only a specific file format that resembles XML but actually isn't. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote: [..] Characters outside the 16-bit range aren't supported on all builds. They won't be supported on most Windows builds, as Windows uses 16-bit Unicode extensively: I knew nothing about UTF-16 friends before this thread. Best part of Unicode is that there are multiple encodings, right? ;-) Moot point on xterm anyway, since you'd be hard put to it to find a decent terminal font that covers anything outside the BMP. Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 unichr(0x1) Traceback (most recent call last): File stdin, line 1, in module ValueError: unichr() arg not in range(0x1) (narrow Python build) Note that narrow builds do understand names outside of the BMP, and generate surrogate pairs for them: u'\N{LINEAR B SYLLABLE B008 A}' u'\U0001' len(_) 2 Whether or not using surrogates in this context is a good idea is open to debate. What's the advantage of a multi-wchar string over a multi-byte string? I don't understand this last remark, but since I'm only a GNU/Linux hobbyist, I guess it doesn't make much difference. Thanks for the code snippet and comments. CJ -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On 10/22/2009 03:23 AM, Gabriel Genellina wrote: En Wed, 21 Oct 2009 15:14:32 -0300, ru...@yahoo.com escribió: On Oct 21, 4:59 am, Bruno Desthuilliers bruno. 42.desthuilli...@websiteburo.invalid wrote: beSTEfar a écrit : (snip) When parsing strings, use Regular Expressions. And now you have _two_ problems g For some simple parsing problems, Python's string methods are powerful enough to make REs overkill. And for any complex enough parsing (any recursive construct for example - think XML, HTML, any programming language etc), REs are just NOT enough by themselves - you need a full blown parser. But keep in mind that many XML, HTML, etc parsing problems are restricted to a subset where you know the nesting depth is limited (often to 0 or 1), and for that large set of problems, RE's *are* enough. I don't think so. Nesting isn't the only problem. RE's cannot handle comments, by example. And you must support unquoted attributes, single and double quotes, any attribute ordering, empty tags, arbitrary whitespace... If you don't, you are not reading XML (or HTML), only a specific file format that resembles XML but actually isn't. OK, then let me rephrase my point as: in the real world it is often not necessary to parse XML in it's full generality; parsing, as you put it, a specific file format that resembles XML is all that is really needed. -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
En Thu, 22 Oct 2009 17:08:21 -0300, ru...@yahoo.com escribió: On 10/22/2009 03:23 AM, Gabriel Genellina wrote: En Wed, 21 Oct 2009 15:14:32 -0300, ru...@yahoo.com escribió: On Oct 21, 4:59 am, Bruno Desthuilliers bruno. 42.desthuilli...@websiteburo.invalid wrote: beSTEfar a écrit : (snip) When parsing strings, use Regular Expressions. And now you have _two_ problems g For some simple parsing problems, Python's string methods are powerful enough to make REs overkill. And for any complex enough parsing (any recursive construct for example - think XML, HTML, any programming language etc), REs are just NOT enough by themselves - you need a full blown parser. But keep in mind that many XML, HTML, etc parsing problems are restricted to a subset where you know the nesting depth is limited (often to 0 or 1), and for that large set of problems, RE's *are* enough. I don't think so. Nesting isn't the only problem. RE's cannot handle comments, by example. And you must support unquoted attributes, single and double quotes, any attribute ordering, empty tags, arbitrary whitespace... If you don't, you are not reading XML (or HTML), only a specific file format that resembles XML but actually isn't. OK, then let me rephrase my point as: in the real world it is often not necessary to parse XML in it's full generality; parsing, as you put it, a specific file format that resembles XML is all that is really needed. Given that using a real XML parser like ElementTree is as easy as (or even easier than) building a regular expression, and more robust, and more likely to survive small changes in the input format, why use the worse solution? RE's are good in solving some problems, but parsing XML isn't one of those. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
George Trojan george.tro...@noaa.gov wrote in message news:hbktk6$8b...@news.nems.noaa.gov... Thanks for all suggestions. It took me a while to find out how to configure my keyboard to be able to type the degree sign. I prefer to stick with pure ASCII if possible. Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt Is that the place to look? George Scott David Daniels wrote: Mark Tolonen wrote: Is there a better way of getting the degrees? It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If you type non-ASCII characters in source code, make sure to declare the encoding the file is *actually* saved in: # coding: utf-8 s = '''48° 13' 16.80 N''' q = s.decode('utf-8') # next line equivalent to previous two q = u'''48° 13' 16.80 N''' # couple ways to find the degrees print int(q[:q.find(u'°')]) import re print re.search(ur'(\d+)°',q).group(1) Mark is right about the source, but you needn't write unicode source to process unicode data. Since nobody else mentioned my favorite way of writing unicode in ASCII, try: IDLE 2.6.3 s = '''48\xc2\xb0 13' 16.80 N''' q = s.decode('utf-8') degrees, rest = q.split(u'\N{DEGREE SIGN}') print degrees 48 print rest 13' 16.80 N And if you are unsure of the name to use: import unicodedata unicodedata.name(u'\xb0') 'DEGREE SIGN' It wouldn't be your favorite way if you were typing Chinese: x = u'我是美国人。' vs. x = u'\N{CJK UNIFIED IDEOGRAPH-6211}\N{CJK UNIFIED IDEOGRAPH-662F}\N{CJK UNIFIED IDEOGRAPH-7F8E}\N{CJK UNIFIED IDEOGRAPH-56FD}\N{CJK UNIFIED IDEOGRAPH-4EBA}\N{IDEOGRAPHIC FULL STOP}' ;^) Mark -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
George Trojan wrote: Scott David Daniels wrote: ... And if you are unsure of the name to use: import unicodedata unicodedata.name(u'\xb0') 'DEGREE SIGN' Thanks for all suggestions. It took me a while to find out how to configure my keyboard to be able to type the degree sign. I prefer to stick with pure ASCII if possible. Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt Is that the place to look? I thought the mention of unicodedata would make it clear. for n in xrange(sys.maxunicode+1): try: nm = unicodedata.name(unichr(n)) except ValueError: pass else: if 'tortoise' in nm.lower(): print n, nm --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On Wed, Oct 21, 2009 at 12:20:35AM EDT, Nobody wrote: On Tue, 20 Oct 2009 17:56:21 +, George Trojan wrote: [..] Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? You can get them from the unicodedata module, e.g.: import unicodedata for i in xrange(0x1): n = unicodedata.name(unichr(i),None) if n is not None: print i, n Python rocks! Just curious, why did you choose to set the upper boundary at 0x? CJ -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
beSTEfar a écrit : (snip) When parsing strings, use Regular Expressions. And now you have _two_ problems g For some simple parsing problems, Python's string methods are powerful enough to make REs overkill. And for any complex enough parsing (any recursive construct for example - think XML, HTML, any programming language etc), REs are just NOT enough by themselves - you need a full blown parser. -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On Wed, 21 Oct 2009 05:16:56 -0400, Chris Jones wrote: Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? You can get them from the unicodedata module, e.g.: import unicodedata for i in xrange(0x1): n = unicodedata.name(unichr(i),None) if n is not None: print i, n Python rocks! Just curious, why did you choose to set the upper boundary at 0x? Characters outside the 16-bit range aren't supported on all builds. They won't be supported on most Windows builds, as Windows uses 16-bit Unicode extensively: Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 unichr(0x1) Traceback (most recent call last): File stdin, line 1, in module ValueError: unichr() arg not in range(0x1) (narrow Python build) Note that narrow builds do understand names outside of the BMP, and generate surrogate pairs for them: u'\N{LINEAR B SYLLABLE B008 A}' u'\U0001' len(_) 2 Whether or not using surrogates in this context is a good idea is open to debate. What's the advantage of a multi-wchar string over a multi-byte string? -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On Oct 21, 4:59 am, Bruno Desthuilliers bruno. 42.desthuilli...@websiteburo.invalid wrote: beSTEfar a écrit : (snip) When parsing strings, use Regular Expressions. And now you have _two_ problems g For some simple parsing problems, Python's string methods are powerful enough to make REs overkill. And for any complex enough parsing (any recursive construct for example - think XML, HTML, any programming language etc), REs are just NOT enough by themselves - you need a full blown parser. But keep in mind that many XML, HTML, etc parsing problems are restricted to a subset where you know the nesting depth is limited (often to 0 or 1), and for that large set of problems, RE's *are* enough. -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
Nobody wrote: Just curious, why did you choose to set the upper boundary at 0x? Characters outside the 16-bit range aren't supported on all builds. They won't be supported on most Windows builds, as Windows uses 16-bit Unicode extensively: Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 unichr(0x1) Traceback (most recent call last): File stdin, line 1, in module ValueError: unichr() arg not in range(0x1) (narrow Python build) In Python 3, if not 2.6, chr(0x1) (what used to be unichr()) works fine on Windows, and generates the appropriate surrogate pair. -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
Mark Tolonen wrote: Is there a better way of getting the degrees? It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If you type non-ASCII characters in source code, make sure to declare the encoding the file is *actually* saved in: # coding: utf-8 s = '''48° 13' 16.80 N''' q = s.decode('utf-8') # next line equivalent to previous two q = u'''48° 13' 16.80 N''' # couple ways to find the degrees print int(q[:q.find(u'°')]) import re print re.search(ur'(\d+)°',q).group(1) Mark is right about the source, but you needn't write unicode source to process unicode data. Since nobody else mentioned my favorite way of writing unicode in ASCII, try: IDLE 2.6.3 s = '''48\xc2\xb0 13' 16.80 N''' q = s.decode('utf-8') degrees, rest = q.split(u'\N{DEGREE SIGN}') print degrees 48 print rest 13' 16.80 N And if you are unsure of the name to use: import unicodedata unicodedata.name(u'\xb0') 'DEGREE SIGN' --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
Thanks for all suggestions. It took me a while to find out how to configure my keyboard to be able to type the degree sign. I prefer to stick with pure ASCII if possible. Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt Is that the place to look? George Scott David Daniels wrote: Mark Tolonen wrote: Is there a better way of getting the degrees? It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If you type non-ASCII characters in source code, make sure to declare the encoding the file is *actually* saved in: # coding: utf-8 s = '''48° 13' 16.80 N''' q = s.decode('utf-8') # next line equivalent to previous two q = u'''48° 13' 16.80 N''' # couple ways to find the degrees print int(q[:q.find(u'°')]) import re print re.search(ur'(\d+)°',q).group(1) Mark is right about the source, but you needn't write unicode source to process unicode data. Since nobody else mentioned my favorite way of writing unicode in ASCII, try: IDLE 2.6.3 s = '''48\xc2\xb0 13' 16.80 N''' q = s.decode('utf-8') degrees, rest = q.split(u'\N{DEGREE SIGN}') print degrees 48 print rest 13' 16.80 N And if you are unsure of the name to use: import unicodedata unicodedata.name(u'\xb0') 'DEGREE SIGN' --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On Tue, 20 Oct 2009 17:56:21 +, George Trojan wrote: Thanks for all suggestions. It took me a while to find out how to configure my keyboard to be able to type the degree sign. I prefer to stick with pure ASCII if possible. Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt Is that the place to look? You can get them from the unicodedata module, e.g.: import unicodedata for i in xrange(0x1): n = unicodedata.name(unichr(i),None) if n is not None: print i, n -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt Is that the place to look? Correct - you are supposed to fill in a Unicode character name into the \N escape. The specific list of names depends on the version of the UCD which was used in the specific Python version, but the characters you are likely interested in probably had been defined forever. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
a simple unicode question
A trivial one, this is the first time I have to deal with Unicode. I am trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is iso-8859-1. To get the degrees I did encoding='iso-8859-1' q=s.decode(encoding) q.split() [u'48\xc2\xb0', u13', u'16.80', u'N'] r=q.split()[0] int(r[:r.find(unichr(ord('\xc2')))]) 48 Is there a better way of getting the degrees? George -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
George Trojan schrieb: A trivial one, this is the first time I have to deal with Unicode. I am trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is iso-8859-1. To get the degrees I did encoding='iso-8859-1' q=s.decode(encoding) q.split() [u'48\xc2\xb0', u13', u'16.80', u'N'] r=q.split()[0] int(r[:r.find(unichr(ord('\xc2')))]) 48 Is there a better way of getting the degrees? Instead of this rather convoluted way to specify a degree-sign, better do # -*- coding: utf-8 -*- ... int(r[:r.find(u°)]) Please note that the utf-8-encoding has *nothing* todo with your string - it's just the source-file encoding. Of course your editor must use utf-8 for saving the encoding. Or you can use any other one you like. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On 19 Okt, 21:07, George Trojan george.tro...@noaa.gov wrote: A trivial one, this is the first time I have to deal with Unicode. I am trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is iso-8859-1. To get the degrees I did encoding='iso-8859-1' q=s.decode(encoding) q.split() [u'48\xc2\xb0', u13', u'16.80', u'N'] r=q.split()[0] int(r[:r.find(unichr(ord('\xc2')))]) 48 Is there a better way of getting the degrees? George When parsing strings, use Regular Expressions. If you don't know how to, spend some time teaching yourself how to - well spent time! A great tool for playing around with REs is KODOS. For the problem at hand you can e.g.: import re degrees = int(re.findall('\d+', s)[0]) that in essence will group together all groups of consecutive digits, return the first group and int() it. No need to care/know about the fact that the string is Unicode and the underlying coding of the charset. -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
George Trojan george.tro...@noaa.gov wrote in message news:hbidd7$i9...@news.nems.noaa.gov... A trivial one, this is the first time I have to deal with Unicode. I am trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is iso-8859-1. To get the degrees I did encoding='iso-8859-1' q=s.decode(encoding) q.split() [u'48\xc2\xb0', u13', u'16.80', u'N'] r=q.split()[0] int(r[:r.find(unichr(ord('\xc2')))]) 48 Is there a better way of getting the degrees? It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN: -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
George Trojan george.tro...@noaa.gov wrote in message news:hbidd7$i9...@news.nems.noaa.gov... A trivial one, this is the first time I have to deal with Unicode. I am trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is iso-8859-1. To get the degrees I did encoding='iso-8859-1' q=s.decode(encoding) q.split() [u'48\xc2\xb0', u13', u'16.80', u'N'] r=q.split()[0] int(r[:r.find(unichr(ord('\xc2')))]) 48 Is there a better way of getting the degrees? It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If you type non-ASCII characters in source code, make sure to declare the encoding the file is *actually* saved in: # coding: utf-8 s = '''48° 13' 16.80 N''' q = s.decode('utf-8') # next line equivalent to previous two q = u'''48° 13' 16.80 N''' # couple ways to find the degrees print int(q[:q.find(u'°')]) import re print re.search(ur'(\d+)°',q).group(1) -Mark -- http://mail.python.org/mailman/listinfo/python-list
Re: python 3.1 unicode question
jeffunit j...@jeffunit.com wrote: That looks like a surrogate escape (See PEP 383) http://www.python.org/dev/peps/pep-0383/. It indicates the wrong encoding was used to decode the filename. That seems likely. How do I set the encoding to something correct to decode the filename? Clearly windows knows how to display it. I suspect since I complied python with cygwin, that it is using a POSIX standard, rather than a windows specific standard. Of course ideally, I would like my code to work on linux as well as windows, as I back up all of my data to a linux machine with samba. If you are running on a Linux system then the filenames are stored encoded as bytes but the system does not store the encoding. In fact different files in the same directory could use different encodings. That's why Python 3.1 uses the surrogate escapes so that you can at least work with the files even if you can't display the filenames. If you are running on Windows and using the native Python to access an NTFS formatted partition then there shouldn't be a problem: the filenames are stored as unicode and Python uses the unicode apis. Of course you may still not be able to display the filenames if they contain characters not available in your output codepage. If you use cygwin a quick search on Google turned up some old discussions implying that it uses the 8 bit apis which convert characters using the current codepage and converts characters it cannot handle to '?' but I have no idea if that still applies. -- http://mail.python.org/mailman/listinfo/python-list
Re: python 3.1 unicode question
jeffunit j...@jeffunit.com wrote in message news:20090915144123964.ljka6...@cdptpa-omta01.mail.rr.com... I wrote a program that diffs files and prints out matching file names. I will be executing the output with sh, to delete select files. Most of the files names are plain ascii, but about 10% of them have unicode characters in them. When I try to print the string containing the name, I get an exception: 'ascii' codec can't encode character '\udce9' in position 37: ordinal not in range(128) The string is: './Julio_Iglesias-Un_Hombre_Solo-05-Qu\udce9_no_se_rompa_la_noche.mp3' This is on a windows xp system, using python 3.1 which I compiled with the cygwin linux compatability layer tool. Can you tell me what encoding I need to print \udce9 and how to set python to that encoding mode? That looks like a surrogate escape (See PEP 383) http://www.python.org/dev/peps/pep-0383/. It indicates the wrong encoding was used to decode the filename. -Mark -- http://mail.python.org/mailman/listinfo/python-list
Re: python 3.1 unicode question
At 09:25 PM 9/15/2009, Mark Tolonen wrote: jeffunit j...@jeffunit.com wrote in message news:20090915144123964.ljka6...@cdptpa-omta01.mail.rr.com... I wrote a program that diffs files and prints out matching file names. I will be executing the output with sh, to delete select files. Most of the files names are plain ascii, but about 10% of them have unicode characters in them. When I try to print the string containing the name, I get an exception: 'ascii' codec can't encode character '\udce9' in position 37: ordinal not in range(128) The string is: './Julio_Iglesias-Un_Hombre_Solo-05-Qu\udce9_no_se_rompa_la_noche.mp3' This is on a windows xp system, using python 3.1 which I compiled with the cygwin linux compatability layer tool. Can you tell me what encoding I need to print \udce9 and how to set python to that encoding mode? That looks like a surrogate escape (See PEP 383) http://www.python.org/dev/peps/pep-0383/. It indicates the wrong encoding was used to decode the filename. That seems likely. How do I set the encoding to something correct to decode the filename? Clearly windows knows how to display it. I suspect since I complied python with cygwin, that it is using a POSIX standard, rather than a windows specific standard. Of course ideally, I would like my code to work on linux as well as windows, as I back up all of my data to a linux machine with samba. thanks, jeff -- http://mail.python.org/mailman/listinfo/python-list
Re: python 3.1 unicode question
On Tue, Sep 15, 2009 at 9:48 PM, jeffunit j...@jeffunit.com wrote: At 09:25 PM 9/15/2009, Mark Tolonen wrote: jeffunit j...@jeffunit.com wrote in message news:20090915144123964.ljka6...@cdptpa-omta01.mail.rr.com... I wrote a program that diffs files and prints out matching file names. I will be executing the output with sh, to delete select files. Most of the files names are plain ascii, but about 10% of them have unicode characters in them. When I try to print the string containing the name, I get an exception: 'ascii' codec can't encode character '\udce9' in position 37: ordinal not in range(128) The string is: './Julio_Iglesias-Un_Hombre_Solo-05-Qu\udce9_no_se_rompa_la_noche.mp3' This is on a windows xp system, using python 3.1 which I compiled with the cygwin linux compatability layer tool. Can you tell me what encoding I need to print \udce9 and how to set python to that encoding mode? That looks like a surrogate escape (See PEP 383) http://www.python.org/dev/peps/pep-0383/. It indicates the wrong encoding was used to decode the filename. That seems likely. How do I set the encoding to something correct to decode the filename? Clearly windows knows how to display it. I suspect since I complied python with cygwin, that it is using a POSIX standard, rather than a windows specific standard. Of course ideally, I would like my code to work on linux as well as windows, as I back up all of my data to a linux machine with samba. Have you perhaps tried using the native Windows version of Python? Cheers, Chris -- http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Sun, 30 Aug 2009 02:36:49 +, Steven D'Aprano wrote: So long as your terminal has a sensible encoding, and you have a good quality font, you should be able to print any string you can create. UTF-8 isn't a particularly sensible encoding for terminals. Did I mention UTF-8? Out of curiosity, why do you say that UTF-8 isn't sensible for terminals? I don't think I've ever seen a terminal (whether an emulator running on a PC or a hardware terminal) which supports anything like the entire Unicode repertoire, along with right-to-left writing, complex scripts, etc. Even support for double-width characters is uncommon. If your terminal can't handle anything outside of ISO-8859-1, there isn't any advantage to using UTF-8, and some disadvantages; e.g. a typical Unix tty driver will delete the last *byte* from the input buffer when you press backspace (Linux 2.6.* has the IUTF8 flag, but this is non-standard). Historically, terminal I/O has tended to revolve around unibyte encodings, with everything except the endpoints being encoding-agnostic. Anything which falls outside of that is a dog's breakfast; it's no coincidence that the word for messed-up text (arising from an encoding mismatch) was borrowed from Japanese (mojibake). Life is simpler if you can use a unibyte encoding. Apart from anything else, the failure modes tend to be harmless. E.g. you get the wrong glyph rather than two glyphs where you expected one. On a 7-bit channel, you get the wrong printable character rather than a control character (this is why ISO-8859-* reserves \x80-\x9F as control codes rather than using them as printable characters). And Unicode font is an oxymoron. You can merge a whole bunch of fonts together and stuff them into a TTF file; that doesn't make them a font, though. I never mentioned Unicode font either. In any case, there's no reason why a skillful designer can't make a single font which covers the entire Unicode range in a consistent style. Consistency between unrelated scripts is neither realistic nor desirable. E.g. Latin fonts tend to use uniform stroke widths unless they're specifically designed to look like handwriting, whereas Han fonts tend to prefer variable-width strokes which reflect the direction. The main advantage of using Unicode internally is that you can associate encodings with the specific points where data needs to be converted to/from bytes, rather than having to carry the encoding details around the program. Surely the main advantage of Unicode is that it gives you a full and consistent range of characters not limited to the 128 characters provided by ASCII? Nothing stops you from using other encodings, or from using multiple encodings. But using multiple encodings means keeping track of the encodings. This isn't impossible, and it may produce better results (e.g. no information loss from Han unification), but it can be a lot more work. -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
* Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700) Further, does anything, except a printing device need to know the encoding of a piece of text? Python needs to know if you are processing the text. I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) Nothing has changed in that regard. You still need to decode and encode text and for that you have to know the encoding. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Sat, 29 Aug 2009 09:34:43 +0200, Thorsten Kampe wrote: * Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700) Further, does anything, except a printing device need to know the encoding of a piece of text? Python needs to know if you are processing the text. Python only needs to know when you convert the text to or from bytes. I can do this: s = hello t = world print(' '.join([s, t])) hello world and not need to care anything about encodings. So long as your terminal has a sensible encoding, and you have a good quality font, you should be able to print any string you can create. I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) Nothing has changed in that regard. You still need to decode and encode text and for that you have to know the encoding. You only need to worry about encoding when you convert from bytes to text, and visa versa. Admittedly, the most common time you need to do that is when reading input from files, but if all your text strings are generated by Python, and not output anywhere, you shouldn't need to care about encodings. If all your text contains nothing but ASCII characters, you should never need to worry about encodings at all. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Sat, 29 Aug 2009 08:26:54 +, Steven D'Aprano wrote: Python only needs to know when you convert the text to or from bytes. I can do this: s = hello t = world print(' '.join([s, t])) hello world and not need to care anything about encodings. So long as your terminal has a sensible encoding, and you have a good quality font, you should be able to print any string you can create. UTF-8 isn't a particularly sensible encoding for terminals. And Unicode font is an oxymoron. You can merge a whole bunch of fonts together and stuff them into a TTF file; that doesn't make them a font, though. I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) Nothing has changed in that regard. You still need to decode and encode text and for that you have to know the encoding. You only need to worry about encoding when you convert from bytes to text, and visa versa. Admittedly, the most common time you need to do that is when reading input from files, but if all your text strings are generated by Python, and not output anywhere, you shouldn't need to care about encodings. Why would you generate text strings and not output them anywhere? The main advantage of using Unicode internally is that you can associate encodings with the specific points where data needs to be converted to/from bytes, rather than having to carry the encoding details around the program. -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Sat, 29 Aug 2009 20:09:12 +0100, Nobody wrote: On Sat, 29 Aug 2009 08:26:54 +, Steven D'Aprano wrote: Python only needs to know when you convert the text to or from bytes. I can do this: s = hello t = world print(' '.join([s, t])) hello world and not need to care anything about encodings. So long as your terminal has a sensible encoding, and you have a good quality font, you should be able to print any string you can create. UTF-8 isn't a particularly sensible encoding for terminals. Did I mention UTF-8? Out of curiosity, why do you say that UTF-8 isn't sensible for terminals? And Unicode font is an oxymoron. You can merge a whole bunch of fonts together and stuff them into a TTF file; that doesn't make them a font, though. I never mentioned Unicode font either. In any case, there's no reason why a skillful designer can't make a single font which covers the entire Unicode range in a consistent style. I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) Nothing has changed in that regard. You still need to decode and encode text and for that you have to know the encoding. You only need to worry about encoding when you convert from bytes to text, and visa versa. Admittedly, the most common time you need to do that is when reading input from files, but if all your text strings are generated by Python, and not output anywhere, you shouldn't need to care about encodings. Why would you generate text strings and not output them anywhere? Who knows? It doesn't matter -- the point is that you can if you want to. You only need to worry about encodings at input and output, therefore logically if you don't do I/O you can process strings all day long and never worry about encodings at all. The main advantage of using Unicode internally is that you can associate encodings with the specific points where data needs to be converted to/from bytes, rather than having to carry the encoding details around the program. Surely the main advantage of Unicode is that it gives you a full and consistent range of characters not limited to the 128 characters provided by ASCII? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
(Simple?) Unicode Question
Hi All! I have a very simple (and probably stupid) question eluding me. When exactly is the char-set information needed? To make my question clear consider reading a file. While reading a file, all I get is basically an array of bytes. Now suppose a file has 10 bytes in it (all is data, no metadata, forget the BOM and stuff for a little while). I read it into an array of 10 bytes, replace, say, 2nd bytes and write all the bytes back to a new file. Do i need the character encoding mumbo jumbo anywhere in this? Further, does anything, except a printing device need to know the encoding of a piece of text? I mean, as long as we are not trying to get a symbolic representation of a text or get ith character of it, all we need to do is to carry the intended encoding as an auxiliary information to the data stored as byte array. Right? --shashank -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
Further, does anything, except a printing device need to know the encoding of a piece of text? I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) If you're using Python 2.x, though, I believe if you simply set the file opening mode to binary then data you read() should still be treated as an array of bytes, although you may encounter issues trying to access the n'th character. Please do correct me if I'm wrong, anyone. On Thu, 27 Aug 2009 09:39:06 -0700, Shashank Singh shashank.sunny.si...@gmail.com wrote: Hi All! I have a very simple (and probably stupid) question eluding me. When exactly is the char-set information needed? To make my question clear consider reading a file. While reading a file, all I get is basically an array of bytes. Now suppose a file has 10 bytes in it (all is data, no metadata, forget the BOM and stuff for a little while). I read it into an array of 10 bytes, replace, say, 2nd bytes and write all the bytes back to a new file. Do i need the character encoding mumbo jumbo anywhere in this? Further, does anything, except a printing device need to know the encoding of a piece of text? I mean, as long as we are not trying to get a symbolic representation of a text or get ith character of it, all we need to do is to carry the intended encoding as an auxiliary information to the data stored as byte array. Right? --shashank -- Rami Chowdhury Never attribute to malice that which can be attributed to stupidity -- Hanlon's Razor 408-597-7068 (US) / 07875-841-046 (UK) / 0189-245544 (BD) -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Thu, 2009-08-27 at 22:09 +0530, Shashank Singh wrote: Hi All! I have a very simple (and probably stupid) question eluding me. When exactly is the char-set information needed? To make my question clear consider reading a file. While reading a file, all I get is basically an array of bytes. Now suppose a file has 10 bytes in it (all is data, no metadata, forget the BOM and stuff for a little while). I read it into an array of 10 bytes, replace, say, 2nd bytes and write all the bytes back to a new file. Do i need the character encoding mumbo jumbo anywhere in this? Further, does anything, except a printing device need to know the encoding of a piece of text? I mean, as long as we are not trying to get a symbolic representation of a text or get ith character of it, all we need to do is to carry the intended encoding as an auxiliary information to the data stored as byte array. If you are just reading and writing bytes then you are just reading and writing bytes. Where you need to worry about unicode, etc. is when you start treating a series of bytes as TEXT (e.g. how many *characters* are in this byte array).* This is no different, IMO, than treating a byte stream vs a image file. You don't, need to worry about resolution, palette, bit-depth, etc. if you are only treating as a stream of bytes. The only difference between the two is that in Python unicode is a built-in type and image isn't ;) * Just make sure that if you are manipulating byte streams independent of it's textual representation that you open files, e.g., in binary mode. -a -- http://mail.python.org/mailman/listinfo/python-list
Unicode question
I am using python 2.4 on Ubuntu dapper, I am working through Dive into Python. There are a couple of inconsictencies. Firstly sys.setdefaultencoding('iso−8859−1') does not work, I have to do sys.setdefaultencoding = 'iso−8859−1' secondly the following does not give a 'UnicodeError: ASCII encoding error:', and I would expect ti to. In fact it prints out the n with ~ above it fine: sys.setdefaultencoding = 'ascii' s = u'La Pe\xf1a' print s Any insight? Ben -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode question
Ben Edwards (lists) [EMAIL PROTECTED] wrote: I am using python 2.4 on Ubuntu dapper, I am working through Dive into Python. ... Any insight? Ben Did you follow all the instructions, or did you try to call sys.setdefaultencoding interactively? See: http://diveintopython.org/xml_processing/unicode.html#kgp.unicode.4.1 hope this helps, max -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode question
Ben Edwards (lists) wrote: I am using python 2.4 on Ubuntu dapper, I am working through Dive into Python. There are a couple of inconsictencies. Firstly sys.setdefaultencoding('iso-8859-1') does not work, I have to do sys.setdefaultencoding = 'iso-8859-1' When you run a Python script, the interpreter does some of its own stuff before executing your script. One of the things it does is to delete the name sys.setdefaultencoding. This means that by the time even your first line of code runs that name no longer exists and so you will be unable to invoke the function as in your first attempt. The second attempt sys.setdefaultencoding = 'iso-8859-1' is creating a new name under the sys namespace and assigning it a string. This will not have the desired effect, or probably any effect at all. I have found that in order to change the default encoding with that function, you can put the command in a file called sitecustomize.py which, when placed in the appropriate location (which is platform-dependent), will be called in time to have the desired effect. So the order of events is something like: 1. Invoke Python on myscript.py 2. Python does some stuff and then executes sitecustomize.py 3. Python deletes the name sys.setdefaultencoding, thereby making the function that was so-named inaccessible. 4. Python then begins executing myscript.py. Regarding the location of sitecustomize.py, on Windows it is C:\Python24\Lib\sitecustomize.py. My guess is that you should put it in the same directory as the bulk of the Python standard library files. (Also in that directory is a subdirectory called site-packages, where you can put custom modules that will be available for import from any of your scripts.) -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode question
Ben Edwards (lists) wrote: Firstly sys.setdefaultencoding('iso−8859−1') does not work, I have to do sys.setdefaultencoding = 'iso−8859−1' That works, but has no effect. You bind the variable sys.setdefaultencoding to some value, but that value is never used for anything (do sys.getdefaultencoding() to see what I mean). You could just as well write sys.standardkodierung = 'iso-8859-1' secondly the following does not give a 'UnicodeError: ASCII encoding error:', and I would expect ti to. In fact it prints out the n with ~ above it fine: sys.setdefaultencoding = 'ascii' s = u'La Pe\xf1a' print s Any insight? The print statement uses sys.stdout.encoding, not the default encoding. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
[OT] Re: a unicode question?
John Machin wrote: ... and yes Peter, info travels faster also from China that it does from Armenia :-()) Q: Can info travel faster from Armenia than from China? Radio Yerevan: In principle, yes. Just make sure that it doesn't go the other way round the globe or meets some friends on the way... -- http://mail.python.org/mailman/listinfo/python-list
Re: a unicode question?
[EMAIL PROTECTED] wrote: Mr. John Machin This question come form the flow codes. I use the PyXml to build a DOM tree. from xml.dom.ext.reader import HtmlLib doc = HtmlLib.FromHtmlUrl('http://stock.business.sohu.com/q/nbcg.php?code=600028') title_elem = doc.documentElement.getElementsByTagName(TITLE)[0] title_string = title_elem.firstChild.data print title_string # the title_string is unicode, but it is not latin1 code, so I wantto change it. Errr, but the title of the page is written in Chinese and it is not supposed to be crammed into latin1 encoding. What are you trying to do with the string after you squeezed Chinese into latin1? -- http://mail.python.org/mailman/listinfo/python-list
Re: a unicode question?
E, it get's worse: not only is the title written in Chinese, it is encoded as gb2312 -- here is the repr() of the first few chunks: html\nhead\ntitle\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) : \xc4\xd a\xb2\xbf\xc8\xcb\xd4\xb1\xb3\xd6\xb9\xc9 - \xcb\xd1\xba\xfc\xb9\xc9\xc6\xb1/ti tle\nmeta http-equiv='Content-Type' content='text/html; charset=gb2312'\n and here is what you get after that_guff.decode('gb2312') uhtml\nhead\ntitle\u4e2d\u56fd\u77f3\u5316(600028) : \u5185\u90e8\u 4eba\u5458\u6301\u80a1 - \u641c\u72d0\u80a1\u7968/title\nmeta http-equiv='Con tent-Type' content='text/html; charset=gb2312'\n The first 2 characters of the title are recognisable both visually on the browser title and in the unicode as zhong guo i.e. China. BUT the OP's first message is interpreting that gb2312-encoded stuff as Unicode: s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) ' *SOMEBODY* is seriously deluded, and it ain't me, and it ain't Serge :-) ... and yes Peter, info travels faster also from China that it does from Armenia :-()) -- http://mail.python.org/mailman/listinfo/python-list
a unicode question?
Hello, There is a unicode string, I want to change it to ansi string. but it raise an exception. Could you help me? ## I want to change s1 to s2. s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) ' s2 = '\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) ' -- http://mail.python.org/mailman/listinfo/python-list
Re: a unicode question?
What do you mean by ansi string? Here is a superficially not-unreasonable answer to your more specific question: # s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) ' # s2 = '\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) ' # s3 = s1.encode('latin1') # s2 == s3 # True But what are you really trying to achieve? Where does your Unicode data come from? What ranges of characters do you expect it to contain? You need to crunch it into an 8-bit representation because ... what? -- http://mail.python.org/mailman/listinfo/python-list
Re: a unicode question?
Mr. John Machin, Thank you very much! -- http://mail.python.org/mailman/listinfo/python-list
Re: a unicode question?
Mr. John Machin This question come form the flow codes. I use the PyXml to build a DOM tree. from xml.dom.ext.reader import HtmlLib doc = HtmlLib.FromHtmlUrl('http://stock.business.sohu.com/q/nbcg.php?code=600028') title_elem = doc.documentElement.getElementsByTagName(TITLE)[0] title_string = title_elem.firstChild.data print title_string # the title_string is unicode, but it is not latin1 code, so I wantto change it. -- http://mail.python.org/mailman/listinfo/python-list
Unicode question : turn José into uJosé
This is probably stupid and/or misguided but supposing I'm passed a byte-string value that I want to be unicode, this is what I do. I'm sure I'm missing something very important. Short version : s = José #Start with non-unicode string unicoded = eval(u'%s' % José) Long version : s = José #Start with non-unicode string s #Lets look at it 'Jos\xe9' escaped = s.encode('string_escape') escaped 'Jos\\xe9' unicoded = eval(u'%s' % escaped) unicoded u'Jos\xe9' test = uJosé #What they should have passed me test == unicoded #Am I really getting the same thing? True #Yay! -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode question : turn José into uJosé
First of all, if you run this on the console, find out your console's encoding. In my case it is English Windows XP. It uses 'cp437'. C:\chcp Active code page: 437 Then s = José u = uJos\u00e9 # same thing in unicode escape s.decode('cp437') == u # use encoding that match your console True wy This is probably stupid and/or misguided but supposing I'm passed a byte-string value that I want to be unicode, this is what I do. I'm sure I'm missing something very important. Short version : s = José #Start with non-unicode string unicoded = eval(u'%s' % José) Long version : s = José #Start with non-unicode string s #Lets look at it 'Jos\xe9' escaped = s.encode('string_escape') escaped 'Jos\\xe9' unicoded = eval(u'%s' % escaped) unicoded u'Jos\xe9' test = uJosé #What they should have passed me test == unicoded #Am I really getting the same thing? True #Yay! -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode question : turn José into uJosé
maybe a bit off topic, but how does one find the console's encoding from within python? -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode question : turn José into uJosé
The most important thing that you are missing is that you need to know the encoding used for the 8-bit-character string. Let's guess that it's Latin1. Then all you have to do is use the unicode() builtin function, or the string decode method. # s = 'Jos\xe9' # s # 'Jos\xe9' # u = unicode(s, 'latin1') # u # u'Jos\xe9' # u2 = s.decode('latin1') # u2 # u'Jos\xe9' Other important things: (1) Using eval() is not usually the best way to do things. (2) If your code is not in entirely in ASCII, put a coding declaration at the top of the source file. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode question : turn José into uJosé
ianaré wrote: maybe a bit off topic, but how does one find the console's encoding from within python? In [1]: import sys In [3]: sys.stdout.encoding Out[3]: 'cp437' In [4]: sys.stdin.encoding Out[4]: 'cp437' Kent -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode question : turn José into uJosé
Ian Sparks [EMAIL PROTECTED] writes: This is probably stupid and/or misguided but supposing I'm passed a byte-string value that I want to be unicode, this is what I do. I'm sure I'm missing something very important. Perhaps you need to read one of the good Python Unicode tutorials, such as: URL:http://effbot.org/zone/unicode-objects.htm Short version : s = José #Start with non-unicode string In what encoding? Once you step outside the ASCII character set, you *must* be explicit about the encoding used for the text. Because there is no sure way to infer it, Python refuses to guess. If you're going to include literal non-ASCII characters in the code (which is the simplest and most readable way), you must also tell Python what encoding to use when it reads the source file. URL:http://docs.python.org/ref/encodings.html unicoded = eval(u'%s' % José) Once you know the encoding, you can simply say:: str_encoding = iso-8859-1 str = José unicode_str = str.decode(str_encoding) (Note that I didn't type this using the iso-8859-1 encoding, so it's likely to be wrong in that respect; you'll need to change it to match your situation.) -- \To me, boxing is like a ballet, except there's no music, no | `\choreography, and the dancers hit each other. -- Jack Handey | _o__) | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode question
Edward Loper wrote: Walter Dörwald wrote: Edward Loper wrote: [...] Surely there's a better way than converting back and forth 3 times? Is there a reason that the 'backslashreplace' error mode can't be used with codecs.decode? 'abc \xff\xe8 def'.decode('ascii', 'backslashreplace') Traceback (most recent call last): File stdin, line 1, in ? TypeError: don't know how to handle UnicodeDecodeError in error callback The backslashreplace error handler is an *error* *handler*, i.e. it gives you a replacement text if an input character can't be encoded. But a backslash character in an 8bit string is no error, so it won't get replaced on decoding. I'm not sure I follow exactly -- the input string I gave as an example did not contain any backslash characters. Unless by backslash character you mean a character c such that ord(c)127. I guess it depends on which class of errors you think the error handler should be handling. :) The codec system's pretty complex, so I'm willing to accept on faith that there may be a good reason to have error handlers only make replacements in the encode direction, and not in the decode direction. Both directions are completely non-symmetric. On encoding an error can only happen when the character is unencodable (e.g. for charmap codecs anything outside the set of 256 characters). On decoding an error means that the byte stream violates the internal format of the encoding. But a 0x5c byte (i.e. a backslash) in e.g. a latin-1 byte sequence doesn't violate the internal format of the latin-1 encoding (nothing does), so the error handler never kicks in. What you want is a different codec (try e.g. string-escape or unicode-escape). This is very close, but unfortunately won't quite work for my purposes, because it also puts backslashes before ' and \\ and maybe a few other characters. :-/ OK, seems you're stuck with your decode/encode/decode call. print test: '\xff'.encode('string-escape').decode('ascii') test: \'\xff\' print do_what_i_want(test:\xff') test: '\xff' I think I'll just have to stick with rolling my own. Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode question
Edward Loper wrote: [...] Surely there's a better way than converting back and forth 3 times? Is there a reason that the 'backslashreplace' error mode can't be used with codecs.decode? 'abc \xff\xe8 def'.decode('ascii', 'backslashreplace') Traceback (most recent call last): File stdin, line 1, in ? TypeError: don't know how to handle UnicodeDecodeError in error callback The backslashreplace error handler is an *error* *handler*, i.e. it gives you a replacement text if an input character can't be encoded. But a backslash character in an 8bit string is no error, so it won't get replaced on decoding. What you want is a different codec (try e.g. string-escape or unicode-escape). Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode question
Walter Dörwald wrote: Edward Loper wrote: [...] Surely there's a better way than converting back and forth 3 times? Is there a reason that the 'backslashreplace' error mode can't be used with codecs.decode? 'abc \xff\xe8 def'.decode('ascii', 'backslashreplace') Traceback (most recent call last): File stdin, line 1, in ? TypeError: don't know how to handle UnicodeDecodeError in error callback The backslashreplace error handler is an *error* *handler*, i.e. it gives you a replacement text if an input character can't be encoded. But a backslash character in an 8bit string is no error, so it won't get replaced on decoding. I'm not sure I follow exactly -- the input string I gave as an example did not contain any backslash characters. Unless by backslash character you mean a character c such that ord(c)127. I guess it depends on which class of errors you think the error handler should be handling. :) The codec system's pretty complex, so I'm willing to accept on faith that there may be a good reason to have error handlers only make replacements in the encode direction, and not in the decode direction. What you want is a different codec (try e.g. string-escape or unicode-escape). This is very close, but unfortunately won't quite work for my purposes, because it also puts backslashes before ' and \\ and maybe a few other characters. :-/ print test: '\xff'.encode('string-escape').decode('ascii') test: \'\xff\' print do_what_i_want(test:\xff') test: '\xff' I think I'll just have to stick with rolling my own. -Edward -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode question
Edward Loper [EMAIL PROTECTED] wrote: I would like to convert an 8-bit string (i.e., a str) into unicode, treating chars \x00-\x7f as ascii, and converting any chars \x80-xff into a backslashed escape sequences. I.e., I want something like this: decode_with_backslashreplace('abc \xff\xe8 def') u'abc \\xff\\xe8 def' The best I could come up with was: def decode_with_backslashreplace(s): str - unicode return (s.decode('latin1') .encode('ascii', 'backslashreplace') .decode('ascii')) Surely there's a better way than converting back and forth 3 times? I didn't check whether this was faster, although I rather suspect it is not: cvt = lambda x: ord(x)0x80 and x or '\\x'+hex(ord(x)) def decode_with_backslashreplace(s): return ''.join(map(cvt,s)) -- - Tim Roberts, [EMAIL PROTECTED] Providenza Boekelheide, Inc. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode question
Edward Loper wrote: I would like to convert an 8-bit string (i.e., a str) into unicode, treating chars \x00-\x7f as ascii, and converting any chars \x80-xff into a backslashed escape sequences. I.e., I want something like this: decode_with_backslashreplace('abc \xff\xe8 def') u'abc \\xff\\xe8 def' s='abc \xff\xe8 def' s.encode('string_escape') 'abc \\xff\\xe8 def' unicode(s.encode('string_escape')) u'abc \\xff\\xe8 def' Kent -- http://mail.python.org/mailman/listinfo/python-list
unicode question
I would like to convert an 8-bit string (i.e., a str) into unicode, treating chars \x00-\x7f as ascii, and converting any chars \x80-xff into a backslashed escape sequences. I.e., I want something like this: decode_with_backslashreplace('abc \xff\xe8 def') u'abc \\xff\\xe8 def' The best I could come up with was: def decode_with_backslashreplace(s): str - unicode return (s.decode('latin1') .encode('ascii', 'backslashreplace') .decode('ascii')) Surely there's a better way than converting back and forth 3 times? Is there a reason that the 'backslashreplace' error mode can't be used with codecs.decode? 'abc \xff\xe8 def'.decode('ascii', 'backslashreplace') Traceback (most recent call last): File stdin, line 1, in ? TypeError: don't know how to handle UnicodeDecodeError in error callback -Edward -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode Question
David Pratt wrote: This is not working for me. Can someone explain why. Many thanks. Because '\xbe' isn't UTF-8 for the character you want, '\xc2\xbe' is, as you just showed yourself in the code snippet. -- Erik Max Francis [EMAIL PROTECTED] http://www.alcyone.com/max/ San Jose, CA, USA 37 20 N 121 53 W AIM erikmaxfrancis Where are they? -- Enrico Fermi, 1901-1954 -- http://mail.python.org/mailman/listinfo/python-list
Unicode Question
Hi. I am working through some tutorials on unicode and am hoping that someone can help explain this for me. I am on mac platform using python 2.4.1 at the moment. I am experimenting with unicode with the 3/4 symbol. I want to prepare strings for db storage that come from normal Windows machine (cp1252) so my understanding is to unicode and encode to utf-8 and to store properly. Since data will be used on the web I would not have to change my encoding when extracting from the database. This first example I believe simulates this with the 3/4 symbol. Here I want to store '\xc2\xbe' in my database. tq = u'\xbe' tq_utf = tq.encode('utf8') tq, tq_utf (u'\xbe', '\xc2\xbe') To unicode withat a valiable, my understanding is that I can unicode and encode at the same time tq = '\xbe' tq_utf = unicode(tq, 'utf-8') Traceback (most recent call last): File stdin, line 1, in ? UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position 0: unexpected code byte This is not working for me. Can someone explain why. Many thanks. Regards, David -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode Question
The encoding argument to unicode() is used to specify the encoding of the string that you want to translate into unicode. The interpreter stores unicode as unicode, it isn't encoded... unicode('\xbe','cp1252') u'\xbe' unicode('\xbe','cp1252').encode('utf-8') '\xc2\xbe' max -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode Question
Hi Martin. Many thanks for your reply. What I am reall after, the following accomplishes. If you are looking for at the same time, perhaps this is also interesting: py unicode('\xbe', 'windows-1252').encode('utf-8') '\xc2\xbe' Your answer really helped quite a bit to clarify this for me. I am using sqlite3 so it is very happy to have utf-8 encoded unicode. The examples you provided were the additional help I needed. Thank you. Regards, David -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode Question
Hi Erik. Thank you for your reply. The advice I has helped clarify this for me. Regards, David Erik Max Francis wrote: David Pratt wrote: This is not working for me. Can someone explain why. Many thanks. Because '\xbe' isn't UTF-8 for the character you want, '\xc2\xbe' is, as you just showed yourself in the code snippet. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode Question
Hi Max. Many thanks for helping to realize where I was missing the point and making this clearer. Regards, David Max Erickson wrote: The encoding argument to unicode() is used to specify the encoding of the string that you want to translate into unicode. The interpreter stores unicode as unicode, it isn't encoded... unicode('\xbe','cp1252') u'\xbe' unicode('\xbe','cp1252').encode('utf-8') '\xc2\xbe' max -- http://mail.python.org/mailman/listinfo/python-list
Once again a unicode question
Hello, I'm puzzled by this test I made while trying to transform a page in html to plain text. Because I cannot send unicode to feed, nor str so how can I do this ? [EMAIL PROTECTED]:~$ python2.4 .Python 2.4.1c2 (#2, Mar 19 2005, 01:04:19) .[GCC 3.3.5 (Debian 1:3.3.5-12)] on linux2 .Type help, copyright, credits or license for more information. . import formatter . import htmllib . html2txt = htmllib.HTMLParser(formatter.AbstractFormatter(formatter.DumbWriter())) . html2txt.feed(u'D\xe9but') .Traceback (most recent call last): . File stdin, line 1, in ? . File /usr/lib/python2.4/sgmllib.py, line 95, in feed .self.goahead(0) . File /usr/lib/python2.4/sgmllib.py, line 120, in goahead .self.handle_data(rawdata[i:j]) . File /usr/lib/python2.4/htmllib.py, line 65, in handle_data .self.formatter.add_flowing_data(data) . File /usr/lib/python2.4/formatter.py, line 197, in add_flowing_data .self.writer.send_flowing_data(data) . File /usr/lib/python2.4/formatter.py, line 421, in send_flowing_data .write(word) .UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128) . html2txt.feed(u'D\xe9but'.encode('latin1')) .Traceback (most recent call last): . File stdin, line 1, in ? . File /usr/lib/python2.4/sgmllib.py, line 94, in feed .self.rawdata = self.rawdata + data .UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128) . html2txt.feed('Début') .Traceback (most recent call last): . File stdin, line 1, in ? . File /usr/lib/python2.4/sgmllib.py, line 94, in feed .self.rawdata = self.rawdata + data .UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) . -- (° Nicolas Évrard / ) Liège - Belgique ^^ -- http://mail.python.org/mailman/listinfo/python-list
Re: Once again a unicode question
Nicolas Evrard wrote: Hello, I'm puzzled by this test I made while trying to transform a page in html to plain text. Because I cannot send unicode to feed, nor str so how can I do this ? Seems like the parser is in the broken state after the first exception. Feed only binary strings to it. Serge. -- http://mail.python.org/mailman/listinfo/python-list
Re: Once again a unicode question
* Serge Orlov [23:45 26/03/05 CET]: Nicolas Evrard wrote: Hello, I'm puzzled by this test I made while trying to transform a page in html to plain text. Because I cannot send unicode to feed, nor str so how can I do this ? Seems like the parser is in the broken state after the first exception. Feed only binary strings to it. That was that thank you very much. -- (° Nicolas Évrard / ) Liège - Belgique ^^ -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode question
On Tue, 23 Nov 2004 20:37:04 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= [EMAIL PROTECTED] wrote: Steve Holden wrote: Am I the only person who found it scary that Bengt could apparently casually drop on a polynomial the would decode to Löwis? Well, don't give me too much credit, though I admit enjoying a little unearned flattered-ego buzz ;-) But it's not a big deal if you had recently implemented an automatic lambda-printer-outer to solve for a polynomial function f such that f(0)==k0, f(1)==k1, .. f(n)==kn. For a single number k0 that will be lambda x: k0 and for two numbers k0, k1 will be lambda x: k0 + x*(k1-k0) etc. It's a matter of solving some simultaneous equations for the coefficient values, which I had done in response to a previous thread. For that, I happened to have had some experience from the '60s writing variations on an equation solver (back when we congratulated ourselves on getting all (software-implemented) floating point ops other than divide to execute in under a millisecond ;-) Here I was using an exact decimal module I happened to have (also built in response to previous thread discussion ;-), so I didn't even have to look for maximum abs pivot elements in the matrix for this one. And it didn't have to be fast. So it was kind of a fun exercise. But anyway, it was all ready to go at this point, so all I had to was do was run coeffsx.py with the character ord values as args on the command line. The opportunity to use it in a fun way to fake casual wizardry was just dumb luck ;-) I'm not scared, but honored, of course. A bit late responding, but I couldn't think of a clever followup to that ;-) But Just to play fair, print ''.join([chr((lambda x: ( -6244372133*x**31 +3013910052086*x**30 -695396351572920*x**29 +102105752307741620*x**28 -10715303804974659632*x**27 +855734314951919397204*x**26 -54067713339116101354860*x**25 +2774121296568607137441900*x**24 -117725625258165396333623970*x**23 +4187405270602160539007125440*x**22 -126060225187601954901807327900*x**21 +3234908736910295469078183101700*x**20 -71121878980966418114205095297640*x**19 +1344268902923717571167117226451980*x**18 -21886601404074660751245403749948900*x**17 +307180698948793841846368910776059300*x**16 -3714719218772170154406066269371644945*x**15 +38641327091060849304069885597725238090*x**14 -344757809926306996671359721670334393500*x**13 +2627069115710241704477921121071756668600*x**12 -16998869426095431823754237370045113150352*x**11 +92697362475995606001274610327169882407584*x**10 -421837211162827653880286870838716820642880*x**9 +1581695033356657201434736494281105646218880*x**8 -4805817748883837636614530805204695373091328*x**7 +11572394080794032785251889126742747327087616*x**6 -2141782094441901308037452513456003159040*x**5 +29141767437911436346798089144038222112768000*x**4 -2718608642882609434610843144764478140416*x**3 +1533994355659295223664305312404777140224*x**2 -388225373807829537910251710026682204160*x +23023948231698183889631576064000) /274094621805930760590852096000 )(x)) for x in xrange(32)]) Not-ready-to-be-mythologized-though-plenty-flatterable-ly y'rs Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list