subject:"Unicode Question"

Re: unicode question

2015-01-28 Thread Albert-Jan Roskam



On Wed, Jan 28, 2015 8:21 AM CET Terry Reedy wrote:

On 1/27/2015 12:17 AM, Rehab Habeeb wrote:
 Hi there python staff
 does python support arabic language for texts ? and what to do if it
 support it?
 i wrote hello in Arabic using codeskulptor and the powershell just for
 testing and the same error appeared( a sytanx error in unicode)!!

I do not know how complete the support is, but this is copied from 3.4.2, 
which uses tcl/tk 8.6.
 t = الحركات
 for c in t: print(c)  # Prints rightmost char above first
ا
ل
ح
ر
ك
ا
ت

Wow, I never knew this was so clever. Is that with or without an RTL marker?


The following StackOverflow question and response indicate that there may b 
more issue, but it was asked before tcl/tk 8.6 was available, so the answer 
may be partially obsolete.


-- Terry Jan Reedy


-- https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: unicode question

2015-01-28 Thread Michael Torrie

On 01/28/2015 03:17 PM, Albert-Jan Roskam wrote:
 I do not know how complete the support is, but this is copied from 3.4.2, 
 which uses tcl/tk 8.6.
 t = الحركات
 for c in t: print(c)  # Prints rightmost char above first
 ا
 ل
 ح
 ر
 ك
 ا
 ت
 
 Wow, I never knew this was so clever. Is that with or without an RTL marker?

I don't think this has anything to do with Python. Python is simply
spitting out unicode characters as it sees them, starting at string
position 0 and working to the end.  The magic is done by whatever is
displaying the utf-8 output from Python.  If I copy this text to the
clipboard,

t = hi there, الحركات!

and paste it in my terminal (say to Python's shell), which is not BIDI
aware, I get the Arabic letters in reverse order. I tried to paste it
here but no matter what I do thunderbird goes into BIDI mode and makes
them appear right.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: unicode question

2015-01-27 Thread random832

On Tue, Jan 27, 2015, at 12:25, Mark Lawrence wrote:
 People might find this http://bugs.python.org/issue1602 and hence this 
 https://github.com/Drekin/win-unicode-console useful.  The latter is 
 available on pypi.

However, Arabic is one of those scripts that runs up against the real
limitations of the windows console. At least on non-Arabic versions of
Windows, you'll just get a sequence of boxes, and it won't do any
bidirectional processing either. I have no idea what, if anything, it
would do differently on Arabic versions of Windows.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: unicode question

2015-01-27 Thread Terry Reedy


On 1/27/2015 12:17 AM, Rehab Habeeb wrote:

Hi there python staff
does python support arabic language for texts ? and what to do if it
support it?
i wrote hello in Arabic using codeskulptor and the powershell just for
testing and the same error appeared( a sytanx error in unicode)!!


I do not know how complete the support is, but this is copied from 
3.4.2, which uses tcl/tk 8.6.

 t = الحركات
 for c in t: print(c)  # Prints rightmost char above first
ا
ل
ح
ر
ك
ا
ت

The following StackOverflow question and response indicate that there 
may b more issue, but it was asked before tcl/tk 8.6 was available, so 
the answer may be partially obsolete.



--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list

Re: unicode question

2015-01-27 Thread random832

On Tue, Jan 27, 2015, at 00:17, Rehab Habeeb wrote:
 Hi there python staff
 does python support arabic language for texts ? and what to do if it
 support it?
 i wrote hello in Arabic using codeskulptor and the powershell just for
 testing and the same error appeared( a sytanx error in unicode)!!

Python itself supports arabic just fine, but the MS Windows console in
general, and Python's implementation of it in particular, have poor
support for many aspects of unicode, so it's important to define exactly
what you are trying to do.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: unicode question

2015-01-27 Thread Mark Lawrence


On 27/01/2015 16:13, random...@fastmail.us wrote:

On Tue, Jan 27, 2015, at 00:17, Rehab Habeeb wrote:

Hi there python staff
does python support arabic language for texts ? and what to do if it
support it?
i wrote hello in Arabic using codeskulptor and the powershell just for
testing and the same error appeared( a sytanx error in unicode)!!


Python itself supports arabic just fine, but the MS Windows console in
general, and Python's implementation of it in particular, have poor
support for many aspects of unicode, so it's important to define exactly
what you are trying to do.



People might find this http://bugs.python.org/issue1602 and hence this 
https://github.com/Drekin/win-unicode-console useful.  The latter is 
available on pypi.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: unicode question

2015-01-26 Thread Chris Angelico

On Tue, Jan 27, 2015 at 4:17 PM, Rehab Habeeb
moonlight06082...@gmail.com wrote:
 Hi there python staff
 does python support arabic language for texts ? and what to do if it support
 it?
 i wrote hello in Arabic using codeskulptor and the powershell just for
 testing and the same error appeared( a sytanx error in unicode)!!

If you're using Python 3, you have very good support for non-ASCII
text, including Arabic. In Python 2, you can work with Unicode data,
but your variable/function names all have to be in ASCII.

What was your code, and what was the error? Copy and paste them into
the email, and we'll be better able to help you.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

unicode question

2015-01-26 Thread Rehab Habeeb

Hi there python staff
does python support arabic language for texts ? and what to do if it
support it?
i wrote hello in Arabic using codeskulptor and the powershell just for
testing and the same error appeared( a sytanx error in unicode)!!
-- 
https://mail.python.org/mailman/listinfo/python-list

Beginner python 3 unicode question

2013-11-16 Thread Laszlo Nagy


Example interactive:

$ python3
Python 3.3.1 (default, Sep 25 2013, 19:29:01)
[GCC 4.7.3] on linux
Type help, copyright, credits or license for more information.
 import uuid
 import base64
 base64.b32encode(uuid.uuid1().bytes)[:-6].lower()
b'zsz653co6ii6hgjejqhw42ncgy'


But when I put the same thing into a source file I get this:

Traceback (most recent call last):
  File /home/gandalf/Python/Lib/shopzeus/yaaf/ui/widget.py, line 94, 
in __init__

self.eid = uniqueid()
  File /home/gandalf/Python/Lib/shopzeus/yaaf/ui/__init__.py, line 
34, in uniqueid

base64.b32encode(uuid.uuid1().bytes)[:-6].lower()
TypeError: Can't convert 'bytes' object to str implicitly


Why it is behaving differently on the command line? What should I do to 
fix this?



--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Beginner python 3 unicode question

2013-11-16 Thread Luuk


On 16-11-2013 20:12, Laszlo Nagy wrote:

Example interactive:

$ python3
Python 3.3.1 (default, Sep 25 2013, 19:29:01)
[GCC 4.7.3] on linux
Type help, copyright, credits or license for more information.
  import uuid
  import base64
  base64.b32encode(uuid.uuid1().bytes)[:-6].lower()
b'zsz653co6ii6hgjejqhw42ncgy'
 

But when I put the same thing into a source file I get this:

Traceback (most recent call last):
   File /home/gandalf/Python/Lib/shopzeus/yaaf/ui/widget.py, line 94,
in __init__
 self.eid = uniqueid()
   File /home/gandalf/Python/Lib/shopzeus/yaaf/ui/__init__.py, line
34, in uniqueid
 base64.b32encode(uuid.uuid1().bytes)[:-6].lower()
TypeError: Can't convert 'bytes' object to str implicitly


Why it is behaving differently on the command line? What should I do to
fix this?




the error is in one of the lines you did not copy here

because this works without problems:
BEGIN-of script
#!/usr/bin/python

import uuid
import base64
print base64.b32encode(uuid.uuid1().bytes)[:-6].lower()
END-of script

But, i need to say, i'm also a beginner ;)
--
https://mail.python.org/mailman/listinfo/python-list

Re: Beginner python 3 unicode question

2013-11-16 Thread Laszlo Nagy




the error is in one of the lines you did not copy here

because this works without problems:
BEGIN-of script
#!/usr/bin/python

Most probably, your /usr/bin/python program is python version 2, and not 
python version 3


Try the same program with /usr/bin/python3. And also try the interactive 
mode with the same program and I think you will see the same phenomenon.


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Beginner python 3 unicode question

2013-11-16 Thread Laszlo Nagy



Why it is behaving differently on the command line? What should I do 
to fix this?


I was experimenting with this a bit more and found some more confusing 
things. Can somebody please enlight me?


Here is a test function:


def password_hash(self,password):
public = bytearray([random.randint(0,255) for _ in range(5)])
private = bytearray([random.randint(0,255)])
pwd = bytearray(password.encode())
digest = hashlib.sha1(public+pwd+private).digest()
print(digest,digest,type(digest))
print(de,digest.encode())
# and some more stuff here...

This function was called inside a script, and gave me this:

('digest', '\xa0\x98\x8b\xff\x04\xf9V;\xbd\x1eIHzh\x10-\xc5!\x14\x1b', 
type 'str')

Traceback (most recent call last):
  File /home/gandalf/Python/Lib/shopzeus/scripts/yaaf_pwmgr.py, line 
478, in module

pwmgr.run(parser,args)
  File /home/gandalf/Python/Lib/shopzeus/scripts/yaaf_pwmgr.py, line 
241, in run

self.authdb.user_create(name,password,propvalues)
  File /home/gandalf/Python/Lib/shopzeus/yaaf/db/authdb.py, line 205, 
in user_create

password:(password and Binary(self.password_hash(password))) or None,
  File /home/gandalf/Python/Lib/shopzeus/yaaf/db/authdb.py, line 134, 
in password_hash

print(de,digest.encode())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: 
ordinal not in range(128)


Then I have tried the very same thing from the interactive shell:

gandalf@gandalf-HP-G62-Notebook-PC:~/Python/Projects/appserver$ python3
Python 3.3.1 (default, Sep 25 2013, 19:29:01)
[GCC 4.7.3] on linux
Type help, copyright, credits or license for more information.
 digest = '\xa0\x98\x8b\xff\x04\xf9V;\xbd\x1eIHzh\x10-\xc5!\x14\x1b'
 digest.encode()
b'\xc2\xa0\xc2\x98\xc2\x8b\xc3\xbf\x04\xc3\xb9V;\xc2\xbd\x1eIHzh\x10-\xc3\x85!\x14\x1b'



WHAT??? Seems like the default value of the encoding parameter of the 
str.encode method is different if I start it interactively. But this 
contradicts its documentation:


 print(digest.encode.__doc__)
S.encode(encoding='utf-8', errors='strict') - bytes

Encode S using the codec registered for encoding. Default encoding
is 'utf-8'. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that can handle UnicodeEncodeErrors.


So is the default utf-8 or not? Should the documentation be updated? Or 
do we have a bug in the interactive shell?




--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Beginner python 3 unicode question

2013-11-16 Thread Luuk


On 16-11-2013 21:57, Laszlo Nagy wrote:



the error is in one of the lines you did not copy here

because this works without problems:
BEGIN-of script
#!/usr/bin/python


Most probably, your /usr/bin/python program is python version 2, and not
python version 3

Try the same program with /usr/bin/python3. And also try the interactive
mode with the same program and I think you will see the same phenomenon.



adding some '()' helped:
BEGIN-of script
#!/usr/bin/python3

import uuid
import base64
print (base64.b32encode(uuid.uuid1().bytes)[:-6].lower())
END-of script

~/temp python3 --version
Python 3.3.0

--
https://mail.python.org/mailman/listinfo/python-list

Re: Beginner python 3 unicode question [SOLVED]

2013-11-16 Thread Laszlo Nagy





So is the default utf-8 or not? Should the documentation be updated? 
Or do we have a bug in the interactive shell?


It was my fault, sorry. The other program used os.system at some places, 
and it accidentally used python2 instead of python 3. :-(


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Beginner python 3 unicode question

2013-11-16 Thread Chris Angelico

On Sun, Nov 17, 2013 at 8:19 AM, Laszlo Nagy gand...@shopzeus.com wrote:
 print(digest,digest,type(digest))

 This function was called inside a script, and gave me this:

 ('digest', '\xa0\x98\x8b\xff\x04\xf9V;\xbd\x1eIHzh\x10-\xc5!\x14\x1b', type
 'str')


This looks very much like you're running under Python 2. Take care of
which interpreter you're running; that might be because of your
shebang (as Luuk mentioned), or because of what you're typing to
invoke the script; either way, it makes a huge difference. The easiest
solution is probably to invoke the interpreter explicitly:

Interactive mode:
$ python3
Script mode:
$ python3 scriptname.py

But you seem to have something WAY more complex than a single script.
What's the setup? How is Python getting invoked? If your code is
getting imported by something else, no shebang will help you - you
need the other code to be being executed by the other interpreter.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Beginner python 3 unicode question [SOLVED]

2013-11-16 Thread Chris Angelico

On Sun, Nov 17, 2013 at 8:44 AM, Laszlo Nagy gand...@shopzeus.com wrote:


 So is the default utf-8 or not? Should the documentation be updated? Or do
 we have a bug in the interactive shell?

 It was my fault, sorry. The other program used os.system at some places, and
 it accidentally used python2 instead of python 3. :-(

Oh! Didn't see this post before responding. Oh well. Maybe someone
else one day will make use of the other. :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

tkinter unicode question

2010-07-27 Thread jyoung79

Just curious if anyone could shed some light on this?  I'm using 
tkinter, but I can't seem to get certain unicode characters to 
show in the label for Python 3.  

In my test, the label and button will contain the same 3 
characters - a Greek Alpha, a Greek Omega with a circumflex and 
soft breathing accent, and then a Greek Alpha with a soft 
breathing accent.

For Python 2.6, this works great:

# -*- coding: utf-8 -*-
from Tkinter import *
root = Tk()
Label(root, text=u'\u03B1 \u1F66 \u1F00').pack()
Button(root, text=u'\u03B1 \u1F66 \u1F00').pack()
root.mainloop()

However, for Python 3.1.2, the button gets the correct characters, 
but the label only displays the first Greek Alpha character.  
The other 2 characters look like Chinese characters followed by 
an empty box.  Here's the code for Python 3:

# -*- coding: utf-8 -*-
from tkinter import *
root = Tk()
Label(root, text='\u03B1 \u1F66 \u1F00').pack()
Button(root, text='\u03B1 \u1F66 \u1F00').pack()
root.mainloop()

I've done some research and am wondering if it is 
because Python 2.6 comes with tk version 8.5, while Python 3.1.2 
comes with tk version 8.4?  I'm running this on OS X 10.6.4.

Here's a link I found that mentions this same problem:
http://www.mofeel.net/871-comp-lang-python/5879.aspx

If I need to upgrade tk to 8.5, is it best to upgrade it or just
install 'tiles'?  From my readings it looks like upgrading to
8.5 can be a pain due to OS X still pointing back to 8.4.  I
haven't tried it yet in case someone might have an easier
solution.

Thanks for looking at my question.

Jay
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: tkinter unicode question

2010-07-27 Thread Ned Deily

In article 20100727204532.r7gmz.27213.r...@cdptpa-web20-z02,
 jyoun...@kc.rr.com wrote:
 Just curious if anyone could shed some light on this?  I'm using 
 tkinter, but I can't seem to get certain unicode characters to 
 show in the label for Python 3.  
 
 In my test, the label and button will contain the same 3 
 characters - a Greek Alpha, a Greek Omega with a circumflex and 
 soft breathing accent, and then a Greek Alpha with a soft 
 breathing accent.
 
 For Python 2.6, this works great:
 
 # -*- coding: utf-8 -*-
 from Tkinter import *
 root = Tk()
 Label(root, text=u'\u03B1 \u1F66 \u1F00').pack()
 Button(root, text=u'\u03B1 \u1F66 \u1F00').pack()
 root.mainloop()
 
 However, for Python 3.1.2, the button gets the correct characters, 
 but the label only displays the first Greek Alpha character.  
 The other 2 characters look like Chinese characters followed by 
 an empty box.  Here's the code for Python 3:
 
 # -*- coding: utf-8 -*-
 from tkinter import *
 root = Tk()
 Label(root, text='\u03B1 \u1F66 \u1F00').pack()
 Button(root, text='\u03B1 \u1F66 \u1F00').pack()
 root.mainloop()
 
 I've done some research and am wondering if it is 
 because Python 2.6 comes with tk version 8.5, while Python 3.1.2 
 comes with tk version 8.4?  I'm running this on OS X 10.6.4.

Most likely.  Apparently you're using the Apple-supplied Python 2.6 
which, as you say, uses Tk 8.5.  If you had installed the python.org 
2.6, it would likely fail for you in the same way as 3.1, since both use 
Tk 8.4.  (They both fail for me.)

 If I need to upgrade tk to 8.5, is it best to upgrade it or just
 install 'tiles'?  From my readings it looks like upgrading to
 8.5 can be a pain due to OS X still pointing back to 8.4.  I
 haven't tried it yet in case someone might have an easier
 solution.

OS X 10.6 comes with both Tk 8.4 and 8.5.  The problem is that the 
Python Tkinter(2.6) or tkinter(3.1) is linked at build time, not install 
time, to one or the other.   You would need to at least rebuild and 
relink tkinter for 3.1 to use Tk 8.5, which means downloading and 
building Python from source.  New releases of python.org installers are 
now coming in two varieties: the second will be only for 10.6 or later 
and will link with Tk 8.5.  The next new release of Python 3 is likely 
months away, though.  In the meantime, a simpler solution might be to 
download and install the ActiveState Python 3.1 for OS X which does use 
Tk 8.5.  And your test case works for me with it.

-- 
 Ned Deily,
 n...@acm.org

-- 
http://mail.python.org/mailman/listinfo/python-list

Another (simple) unicode question

2009-10-29 Thread Rustom Mody

Construct http://construct.wikispaces.com/ is a kick-ass binary file
structurer (written by a 21 year old!)
I thought of trying to port it to python3 but it barfs on some unicode
related stuff (after running 2to3) which I am unable to wrap my head
around.

Can anyone direct me to what I should read to try to understand this?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Another (simple) unicode question

2009-10-29 Thread John Machin

On Oct 29, 10:02 pm, Rustom Mody rustompm...@gmail.com wrote:
 Constructhttp://construct.wikispaces.com/is a kick-ass binary file
 structurer (written by a 21 year old!)
 I thought of trying to port it to python3 but it barfs on some unicode
 related stuff (after running 2to3) which I am unable to wrap my head
 around.

 Can anyone direct me to what I should read to try to understand this?

unicode related stuff is rather vague. Have you read the Python
Unicode HOWTO? Joel Spolsky's article?

http://www.amk.ca/python/howto/unicode
http://www.joelonsoftware.com/articles/Unicode.html

In any case, it's a debugging problem, isn't it? Could you possibly
consider telling us the error message, the traceback, a few lines of
the 3.x code around where the problem is, and the corresponding 2.x
lines? Are you using 3.1.1 and 2.6.4? Does your test work in 2.6?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Another (simple) unicode question

2009-10-29 Thread Carl Banks

On Oct 29, 4:02 am, Rustom Mody rustompm...@gmail.com wrote:
 Constructhttp://construct.wikispaces.com/is a kick-ass binary file
 structurer (written by a 21 year old!)
 I thought of trying to port it to python3 but it barfs on some unicode
 related stuff (after running 2to3) which I am unable to wrap my head
 around.

2to3 isn't a general Python 2 to Python 3 translator.  You can't pass
any old Python 2.x code through 2to3 and expect it to work.  Rather,
you have to write the Python 2.x code in a subset of Python that I
call transitional dialect.  In order to port to Python 3 using 2to3,
you first have to port it to this transitional dialect.

If Unicode is the issue, one thing you should do to explicitly
classify all strings as binary or text in Python 2.x.  This means to
change str() to unicode() or bytes(), whichever is appropriate, and to
change  to u or b.


Carl Banks
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Another (simple) unicode question

2009-10-29 Thread Scott David Daniels


John Machin wrote:

On Oct 29, 10:02 pm, Rustom Mody rustompm...@gmail.com wrote:...

I thought of trying to port it to python3 but it barfs on some unicode
related stuff (after running 2to3) which I am unable to wrap my head
around.

Can anyone direct me to what I should read to try to understand this?


to which Jon replied with some good links to start, and then:


In any case, it's a debugging problem, isn't it? Could you possibly
consider telling us the error message, the traceback, a few lines of
the 3.x code around where the problem is, and the corresponding 2.x
lines? Are you using 3.1.1 and 2.6.4? Does your test work in 2.6?


Also consider how 2to3 translates the problem section(s).

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-28 Thread Gabriel Genellina

En Wed, 28 Oct 2009 02:28:01 -0300, Chris Jones cjns1...@gmail.com  
escribió:

On Tue, Oct 27, 2009 at 06:21:11AM EDT, Lie Ryan wrote:

Chris Jones wrote:

Best part of Unicode is that there are multiple encodings, right? ;-)

No, the best part about Unicode is there is no encoding!
Unicode does not define any encoding;


RFC 3629:
ISO/IEC 10646 and Unicode define several encoding forms of their
common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32.


what it defines is code-points for  characters which is not related to
how characters are encoded in files or network transmission.


In other words, Unicode is not related to any encoding .. and yet the
UTF-8, UTF-16.. encoding forms are clearly related to Unicode.

How is that possible?


Start reading The Absolute Minimum Every Software Developer Absolutely,  
Positively Must Know About Unicode and Character Sets (No Excuses!), by  
Joel Spolsky.

http://www.joelonsoftware.com/articles/Unicode.html

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-28 Thread Tim Arnold

Chris Jones cjns1...@gmail.com wrote in message 
news:mailman.2149.1256707687.2807.python-l...@python.org...
 On Tue, Oct 27, 2009 at 06:21:11AM EDT, Lie Ryan wrote:
 Chris Jones wrote:

 [..]

 Best part of Unicode is that there are multiple encodings, right? ;-)

 No, the best part about Unicode is there is no encoding!

 Unicode does not define any encoding;

 RFC 3629:

 ISO/IEC 10646 and Unicode define several encoding forms of their
 common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32.

 what it defines is code-points for  characters which is not related to
 how characters are encoded in files or network transmission.

 In other words, Unicode is not related to any encoding .. and yet the
 UTF-8, UTF-16.. encoding forms are clearly related to Unicode.

 How is that possible?

 CJ

When I first saw it, my first thought was that the subjectline was an 
oxymoron.

--Tim Arnold


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-27 Thread Chris Jones

On Tue, Oct 27, 2009 at 06:21:11AM EDT, Lie Ryan wrote:
 Chris Jones wrote:

[..]

 Best part of Unicode is that there are multiple encodings, right? ;-)

 No, the best part about Unicode is there is no encoding!

 Unicode does not define any encoding; 

RFC 3629:

ISO/IEC 10646 and Unicode define several encoding forms of their
common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32.

 what it defines is code-points for  characters which is not related to
 how characters are encoded in files or network transmission.

In other words, Unicode is not related to any encoding .. and yet the
UTF-8, UTF-16.. encoding forms are clearly related to Unicode.

How is that possible?

CJ
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-27 Thread Lie Ryan


Chris Jones wrote:

On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote:

[..]


Characters outside the 16-bit range aren't supported on all builds.
They won't be supported on most Windows builds, as Windows uses 16-bit
Unicode extensively:


I knew nothing about UTF-16  friends before this thread.

Best part of Unicode is that there are multiple encodings, right? ;-)


No, the best part about Unicode is there is no encoding!

Unicode does not define any encoding; what it defines is code-points for 
characters which is not related to how characters are encoded in files 
or network transmission.

--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-22 Thread Gabriel Genellina


En Wed, 21 Oct 2009 15:14:32 -0300, ru...@yahoo.com escribió:


On Oct 21, 4:59 am, Bruno Desthuilliers bruno.
42.desthuilli...@websiteburo.invalid wrote:

beSTEfar a écrit :
(snip)
  When parsing strings, use Regular Expressions.

And now you have _two_ problems g

For some simple parsing problems, Python's string methods are powerful
enough to make REs overkill. And for any complex enough parsing (any
recursive construct for example - think XML, HTML, any programming
language etc), REs are just NOT enough by themselves - you need a full
blown parser.


But keep in mind that many XML, HTML, etc parsing problems
are restricted to a subset where you know the nesting depth
is limited (often to 0 or 1), and for that large set of
problems, RE's *are* enough.


I don't think so. Nesting isn't the only problem. RE's cannot handle  
comments, by example. And you must support unquoted attributes, single and  
double quotes, any attribute ordering, empty tags, arbitrary whitespace...  
If you don't, you are not reading XML (or HTML), only a specific file  
format that resembles XML but actually isn't.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-22 Thread Chris Jones

On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote:

[..]

 Characters outside the 16-bit range aren't supported on all builds.
 They won't be supported on most Windows builds, as Windows uses 16-bit
 Unicode extensively:

I knew nothing about UTF-16  friends before this thread.

Best part of Unicode is that there are multiple encodings, right? ;-)

Moot point on xterm anyway, since you'd be hard put to it to find a
decent terminal font that covers anything outside the BMP.

   Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
   (Intel)] on win32

unichr(0x1)
   Traceback (most recent call last):
 File stdin, line 1, in module
   ValueError: unichr() arg not in range(0x1) (narrow Python build)
 
 Note that narrow builds do understand names outside of the BMP, and
 generate surrogate pairs for them:
 
u'\N{LINEAR B SYLLABLE B008 A}'
   u'\U0001'
len(_)
   2
 
 Whether or not using surrogates in this context is a good idea is open to
 debate. What's the advantage of a multi-wchar string over a multi-byte
 string?

I don't understand this last remark, but since I'm only a GNU/Linux
hobbyist, I guess it doesn't make much difference.

Thanks for the code snippet and comments.

CJ
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-22 Thread rurpy

On 10/22/2009 03:23 AM, Gabriel Genellina wrote:
 En Wed, 21 Oct 2009 15:14:32 -0300, ru...@yahoo.com escribió:

 On Oct 21, 4:59 am, Bruno Desthuilliers bruno.
 42.desthuilli...@websiteburo.invalid wrote:
 beSTEfar a écrit :
 (snip)
   When parsing strings, use Regular Expressions.

 And now you have _two_ problems g

 For some simple parsing problems, Python's string methods are powerful
 enough to make REs overkill. And for any complex enough parsing (any
 recursive construct for example - think XML, HTML, any programming
 language etc), REs are just NOT enough by themselves - you need a full
 blown parser.

 But keep in mind that many XML, HTML, etc parsing problems
 are restricted to a subset where you know the nesting depth
 is limited (often to 0 or 1), and for that large set of
 problems, RE's *are* enough.

 I don't think so. Nesting isn't the only problem. RE's cannot handle
 comments, by example. And you must support unquoted attributes, single and
 double quotes, any attribute ordering, empty tags, arbitrary whitespace...
 If you don't, you are not reading XML (or HTML), only a specific file
 format that resembles XML but actually isn't.

OK, then let me rephrase my point as: in the real world it is often
not necessary to parse XML in it's full generality; parsing, as you
put it, a specific file format that resembles XML is all that is
really needed.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-22 Thread Gabriel Genellina


En Thu, 22 Oct 2009 17:08:21 -0300, ru...@yahoo.com escribió:


On 10/22/2009 03:23 AM, Gabriel Genellina wrote:

En Wed, 21 Oct 2009 15:14:32 -0300, ru...@yahoo.com escribió:


On Oct 21, 4:59 am, Bruno Desthuilliers bruno.
42.desthuilli...@websiteburo.invalid wrote:

beSTEfar a écrit :
(snip)
  When parsing strings, use Regular Expressions.

And now you have _two_ problems g

For some simple parsing problems, Python's string methods are powerful
enough to make REs overkill. And for any complex enough parsing (any
recursive construct for example - think XML, HTML, any programming
language etc), REs are just NOT enough by themselves - you need a full
blown parser.


But keep in mind that many XML, HTML, etc parsing problems
are restricted to a subset where you know the nesting depth
is limited (often to 0 or 1), and for that large set of
problems, RE's *are* enough.


I don't think so. Nesting isn't the only problem. RE's cannot handle
comments, by example. And you must support unquoted attributes, single  
and
double quotes, any attribute ordering, empty tags, arbitrary  
whitespace...

If you don't, you are not reading XML (or HTML), only a specific file
format that resembles XML but actually isn't.


OK, then let me rephrase my point as: in the real world it is often
not necessary to parse XML in it's full generality; parsing, as you
put it, a specific file format that resembles XML is all that is
really needed.


Given that using a real XML parser like ElementTree is as easy as (or even  
easier than) building a regular expression, and more robust, and more  
likely to survive small changes in the input format, why use the worse  
solution?

RE's are good in solving some problems, but parsing XML isn't one of those.

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-21 Thread Mark Tolonen



George Trojan george.tro...@noaa.gov wrote in message 
news:hbktk6$8b...@news.nems.noaa.gov...

Thanks for all suggestions. It took me a while to find out how to
configure my keyboard to be able to type the degree sign. I prefer to
stick with pure ASCII if possible.
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found
http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
Is that the place to look?

George

Scott David Daniels wrote:

Mark Tolonen wrote:

Is there a better way of getting the degrees?


It seems your string is UTF-8.  \xc2\xb0 is UTF-8 for DEGREE SIGN.  If 
you type non-ASCII characters in source code, make sure to declare the 
encoding the file is *actually* saved in:


# coding: utf-8

s = '''48° 13' 16.80 N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80 N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)



Mark is right about the source, but you needn't write unicode source
to process unicode data.  Since nobody else mentioned my favorite way
of writing unicode in ASCII, try:

IDLE 2.6.3
  s = '''48\xc2\xb0 13' 16.80 N'''
  q = s.decode('utf-8')
  degrees, rest = q.split(u'\N{DEGREE SIGN}')
  print degrees
48
  print rest
 13' 16.80 N

And if you are unsure of the name to use:
  import unicodedata
  unicodedata.name(u'\xb0')
'DEGREE SIGN'


It wouldn't be your favorite way if you were typing Chinese:

x = u'我是美国人。'

vs.

x = u'\N{CJK UNIFIED IDEOGRAPH-6211}\N{CJK UNIFIED IDEOGRAPH-662F}\N{CJK 
UNIFIED IDEOGRAPH-7F8E}\N{CJK UNIFIED IDEOGRAPH-56FD}\N{CJK UNIFIED 
IDEOGRAPH-4EBA}\N{IDEOGRAPHIC FULL STOP}'


;^) Mark





--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-21 Thread Scott David Daniels


George Trojan wrote:

Scott David Daniels wrote:

...

And if you are unsure of the name to use:
  import unicodedata
  unicodedata.name(u'\xb0')
'DEGREE SIGN'


 Thanks for all suggestions. It took me a while to find out how to
 configure my keyboard to be able to type the degree sign. I prefer to
 stick with pure ASCII if possible.
 Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found
 http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
 Is that the place to look?

I thought the mention of unicodedata would make it clear.

 for n in xrange(sys.maxunicode+1):
try:
nm = unicodedata.name(unichr(n))
except ValueError: pass
else:
if 'tortoise' in nm.lower(): print n, nm


--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-21 Thread Chris Jones

On Wed, Oct 21, 2009 at 12:20:35AM EDT, Nobody wrote:
 On Tue, 20 Oct 2009 17:56:21 +, George Trojan wrote:

[..]

  Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? 
 
 You can get them from the unicodedata module, e.g.:
 
   import unicodedata
   for i in xrange(0x1):
 n = unicodedata.name(unichr(i),None)
 if n is not None:
   print i, n

Python rocks!

Just curious, why did you choose to set the upper boundary at 0x?

CJ
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-21 Thread Bruno Desthuilliers


beSTEfar a écrit :
(snip)
 When parsing strings, use Regular Expressions.

And now you have _two_ problems g

For some simple parsing problems, Python's string methods are powerful 
enough to make REs overkill. And for any complex enough parsing (any 
recursive construct for example - think XML, HTML, any programming 
language etc), REs are just NOT enough by themselves - you need a full 
blown parser.


--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-21 Thread Nobody

On Wed, 21 Oct 2009 05:16:56 -0400, Chris Jones wrote:

  Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? 
 
 You can get them from the unicodedata module, e.g.:
 
  import unicodedata
  for i in xrange(0x1):
n = unicodedata.name(unichr(i),None)
if n is not None:
  print i, n
 
 Python rocks!
 
 Just curious, why did you choose to set the upper boundary at 0x?

Characters outside the 16-bit range aren't supported on all builds. They
won't be supported on most Windows builds, as Windows uses 16-bit Unicode
extensively:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit 
(Intel)] on
win32
 unichr(0x1)
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: unichr() arg not in range(0x1) (narrow Python build)

Note that narrow builds do understand names outside of the BMP, and
generate surrogate pairs for them:

 u'\N{LINEAR B SYLLABLE B008 A}'
u'\U0001'
 len(_)
2

Whether or not using surrogates in this context is a good idea is open to
debate. What's the advantage of a multi-wchar string over a multi-byte
string?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-21 Thread rurpy

On Oct 21, 4:59 am, Bruno Desthuilliers bruno.
42.desthuilli...@websiteburo.invalid wrote:
 beSTEfar a écrit :
 (snip)
   When parsing strings, use Regular Expressions.

 And now you have _two_ problems g

 For some simple parsing problems, Python's string methods are powerful
 enough to make REs overkill. And for any complex enough parsing (any
 recursive construct for example - think XML, HTML, any programming
 language etc), REs are just NOT enough by themselves - you need a full
 blown parser.

But keep in mind that many XML, HTML, etc parsing problems
are restricted to a subset where you know the nesting depth
is limited (often to 0 or 1), and for that large set of
problems, RE's *are* enough.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-21 Thread Terry Reedy


Nobody wrote:


Just curious, why did you choose to set the upper boundary at 0x?


Characters outside the 16-bit range aren't supported on all builds. They
won't be supported on most Windows builds, as Windows uses 16-bit Unicode
extensively:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit 
(Intel)] on
win32
 unichr(0x1)
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: unichr() arg not in range(0x1) (narrow Python build)


In Python 3, if not 2.6, chr(0x1) (what used to be unichr()) works 
fine on Windows, and generates the appropriate surrogate pair.


--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-20 Thread Scott David Daniels


Mark Tolonen wrote:

Is there a better way of getting the degrees?


It seems your string is UTF-8.  \xc2\xb0 is UTF-8 for DEGREE SIGN.  If 
you type non-ASCII characters in source code, make sure to declare the 
encoding the file is *actually* saved in:


# coding: utf-8

s = '''48° 13' 16.80 N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80 N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)



Mark is right about the source, but you needn't write unicode source
to process unicode data.  Since nobody else mentioned my favorite way
of writing unicode in ASCII, try:

IDLE 2.6.3
 s = '''48\xc2\xb0 13' 16.80 N'''
 q = s.decode('utf-8')
 degrees, rest = q.split(u'\N{DEGREE SIGN}')
 print degrees
48
 print rest
 13' 16.80 N

And if you are unsure of the name to use:
 import unicodedata
 unicodedata.name(u'\xb0')
'DEGREE SIGN'

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-20 Thread George Trojan

Thanks for all suggestions. It took me a while to find out how to 
configure my keyboard to be able to type the degree sign. I prefer to 
stick with pure ASCII if possible.
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found 
http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt

Is that the place to look?

George

Scott David Daniels wrote:

Mark Tolonen wrote:

Is there a better way of getting the degrees?


It seems your string is UTF-8.  \xc2\xb0 is UTF-8 for DEGREE SIGN.  If 
you type non-ASCII characters in source code, make sure to declare the 
encoding the file is *actually* saved in:


# coding: utf-8

s = '''48° 13' 16.80 N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80 N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)



Mark is right about the source, but you needn't write unicode source
to process unicode data.  Since nobody else mentioned my favorite way
of writing unicode in ASCII, try:

IDLE 2.6.3
  s = '''48\xc2\xb0 13' 16.80 N'''
  q = s.decode('utf-8')
  degrees, rest = q.split(u'\N{DEGREE SIGN}')
  print degrees
48
  print rest
 13' 16.80 N

And if you are unsure of the name to use:
  import unicodedata
  unicodedata.name(u'\xb0')
'DEGREE SIGN'

--Scott David Daniels
scott.dani...@acm.org

--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-20 Thread Nobody

On Tue, 20 Oct 2009 17:56:21 +, George Trojan wrote:

 Thanks for all suggestions. It took me a while to find out how to 
 configure my keyboard to be able to type the degree sign. I prefer to 
 stick with pure ASCII if possible.
 Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found 
 http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
 Is that the place to look?

You can get them from the unicodedata module, e.g.:

import unicodedata
for i in xrange(0x1):
  n = unicodedata.name(unichr(i),None)
  if n is not None:
print i, n

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-20 Thread Martin v. Löwis

 Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found
 http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
 Is that the place to look?

Correct - you are supposed to fill in a Unicode character name into
the \N escape. The specific list of names depends on the version of
the UCD which was used in the specific Python version, but the
characters you are likely interested in probably had been defined
forever.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

a simple unicode question

2009-10-19 Thread George Trojan

A trivial one, this is the first time I have to deal with Unicode. I am 
trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is 
iso-8859-1. To get the degrees I did

 encoding='iso-8859-1'
 q=s.decode(encoding)
 q.split()
[u'48\xc2\xb0', u13', u'16.80', u'N']
 r=q.split()[0]
 int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?

George
--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-19 Thread Diez B. Roggisch


George Trojan schrieb:
A trivial one, this is the first time I have to deal with Unicode. I am 
trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is 
iso-8859-1. To get the degrees I did

  encoding='iso-8859-1'
  q=s.decode(encoding)
  q.split()
[u'48\xc2\xb0', u13', u'16.80', u'N']
  r=q.split()[0]
  int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?


Instead of this rather convoluted way to specify a degree-sign, better do

 # -*- coding: utf-8 -*-
 ...
 int(r[:r.find(u°)])


Please note that the utf-8-encoding has *nothing* todo with your string 
- it's just the source-file encoding. Of course your editor must use 
utf-8 for saving the encoding. Or you can use any other one you like.


Diez
--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-19 Thread beSTEfar

On 19 Okt, 21:07, George Trojan george.tro...@noaa.gov wrote:
 A trivial one, this is the first time I have to deal with Unicode. I am
 trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is
 iso-8859-1. To get the degrees I did
   encoding='iso-8859-1'
   q=s.decode(encoding)
   q.split()
 [u'48\xc2\xb0', u13', u'16.80', u'N']
   r=q.split()[0]
   int(r[:r.find(unichr(ord('\xc2')))])
 48

 Is there a better way of getting the degrees?

 George

When parsing strings, use Regular Expressions. If you don't know how
to, spend some time teaching yourself how to - well spent time! A
great tool for playing around with REs is KODOS.

For the problem at hand you can e.g.:

  import re
  degrees = int(re.findall('\d+', s)[0])

that in essence will group together all groups of consecutive digits,
return the first group and int() it. No need to care/know about the
fact that the string is Unicode and the underlying coding of the
charset.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-19 Thread Mark Tolonen



George Trojan george.tro...@noaa.gov wrote in message 
news:hbidd7$i9...@news.nems.noaa.gov...
A trivial one, this is the first time I have to deal with Unicode. I am 
trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is 
iso-8859-1. To get the degrees I did

  encoding='iso-8859-1'
  q=s.decode(encoding)
  q.split()
[u'48\xc2\xb0', u13', u'16.80', u'N']
  r=q.split()[0]
  int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?


It seems your string is UTF-8.  \xc2\xb0 is UTF-8 for DEGREE SIGN:


--
http://mail.python.org/mailman/listinfo/python-list

Re: a simple unicode question

2009-10-19 Thread Mark Tolonen



George Trojan george.tro...@noaa.gov wrote in message 
news:hbidd7$i9...@news.nems.noaa.gov...
A trivial one, this is the first time I have to deal with Unicode. I am 
trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is 
iso-8859-1. To get the degrees I did

  encoding='iso-8859-1'
  q=s.decode(encoding)
  q.split()
[u'48\xc2\xb0', u13', u'16.80', u'N']
  r=q.split()[0]
  int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?


It seems your string is UTF-8.  \xc2\xb0 is UTF-8 for DEGREE SIGN.  If you 
type non-ASCII characters in source code, make sure to declare the encoding 
the file is *actually* saved in:


# coding: utf-8

s = '''48° 13' 16.80 N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80 N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)

-Mark


--
http://mail.python.org/mailman/listinfo/python-list

Re: python 3.1 unicode question

2009-09-16 Thread Duncan Booth

jeffunit j...@jeffunit.com wrote:

That looks like a surrogate escape (See PEP 383) 
http://www.python.org/dev/peps/pep-0383/.  It indicates the wrong 
encoding was used to decode the filename.
 
 That seems likely. How do I set the encoding to something correct to 
 decode the filename?
 
 Clearly windows knows how to display it.
 I suspect since I complied python with cygwin, that it is using a 
 POSIX standard,
 rather than a windows specific standard. Of course ideally, I would 
 like my code to work
 on linux as well as windows, as I back up all of my data to a linux 
 machine with
 samba.
 
If you are running on a Linux system then the filenames are stored encoded 
as bytes but the system does not store the encoding. In fact different 
files in the same directory could use different encodings. That's why 
Python 3.1 uses the surrogate escapes so that you can at least work with 
the files even if you can't display the filenames.

If you are running on Windows and using the native Python to access an NTFS 
formatted partition then there shouldn't be a problem: the filenames are 
stored as unicode and Python uses the unicode apis. Of course you may still 
not be able to display the filenames if they contain characters not 
available in your output codepage.

If you use cygwin a quick search on Google turned up some old discussions 
implying that it uses the 8 bit apis which convert characters using the 
current codepage and converts characters it cannot handle to '?' but I have 
no idea if that still applies.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python 3.1 unicode question

2009-09-15 Thread Mark Tolonen

jeffunit j...@jeffunit.com wrote in message 
news:20090915144123964.ljka6...@cdptpa-omta01.mail.rr.com...

I wrote a program that diffs files and prints out matching file names.
I will be executing the output with sh, to delete select files.

Most of the files names are plain ascii, but about 10% of them have 
unicode
characters in them. When I try to print the string containing the name, I 
get

an exception:

'ascii' codec can't encode character '\udce9'
in position 37: ordinal not in range(128)

The string is:

'./Julio_Iglesias-Un_Hombre_Solo-05-Qu\udce9_no_se_rompa_la_noche.mp3'

This is on a windows xp system, using python 3.1 which I compiled
with the cygwin
linux compatability layer tool.

Can you tell me what encoding I need to print \udce9 and how to set python 
to

that encoding mode?


That looks like a surrogate escape (See PEP 383) 
http://www.python.org/dev/peps/pep-0383/.  It indicates the wrong encoding 
was used to decode the filename.


-Mark


--
http://mail.python.org/mailman/listinfo/python-list

Re: python 3.1 unicode question

2009-09-15 Thread jeffunit


At 09:25 PM 9/15/2009, Mark Tolonen wrote:
jeffunit j...@jeffunit.com wrote in message 
news:20090915144123964.ljka6...@cdptpa-omta01.mail.rr.com...

I wrote a program that diffs files and prints out matching file names.
I will be executing the output with sh, to delete select files.

Most of the files names are plain ascii, but about 10% of them have unicode
characters in them. When I try to print the string containing the name, I get
an exception:

'ascii' codec can't encode character '\udce9'
in position 37: ordinal not in range(128)

The string is:

'./Julio_Iglesias-Un_Hombre_Solo-05-Qu\udce9_no_se_rompa_la_noche.mp3'

This is on a windows xp system, using python 3.1 which I compiled
with the cygwin
linux compatability layer tool.

Can you tell me what encoding I need to print \udce9 and how to set python to
that encoding mode?


That looks like a surrogate escape (See PEP 383) 
http://www.python.org/dev/peps/pep-0383/.  It indicates the wrong 
encoding was used to decode the filename.


That seems likely. How do I set the encoding to something correct to 
decode the filename?


Clearly windows knows how to display it.
I suspect since I complied python with cygwin, that it is using a 
POSIX standard,
rather than a windows specific standard. Of course ideally, I would 
like my code to work
on linux as well as windows, as I back up all of my data to a linux 
machine with

samba.

thanks,
jeff

--
http://mail.python.org/mailman/listinfo/python-list

Re: python 3.1 unicode question

2009-09-15 Thread Chris Rebert

On Tue, Sep 15, 2009 at 9:48 PM, jeffunit j...@jeffunit.com wrote:
 At 09:25 PM 9/15/2009, Mark Tolonen wrote:

 jeffunit j...@jeffunit.com wrote in message
 news:20090915144123964.ljka6...@cdptpa-omta01.mail.rr.com...

 I wrote a program that diffs files and prints out matching file names.
 I will be executing the output with sh, to delete select files.

 Most of the files names are plain ascii, but about 10% of them have
 unicode
 characters in them. When I try to print the string containing the name, I
 get
 an exception:

 'ascii' codec can't encode character '\udce9'
 in position 37: ordinal not in range(128)

 The string is:

 './Julio_Iglesias-Un_Hombre_Solo-05-Qu\udce9_no_se_rompa_la_noche.mp3'

 This is on a windows xp system, using python 3.1 which I compiled
 with the cygwin
 linux compatability layer tool.

 Can you tell me what encoding I need to print \udce9 and how to set
 python to
 that encoding mode?

 That looks like a surrogate escape (See PEP 383)
 http://www.python.org/dev/peps/pep-0383/.  It indicates the wrong encoding
 was used to decode the filename.

 That seems likely. How do I set the encoding to something correct to decode
 the filename?

 Clearly windows knows how to display it.
 I suspect since I complied python with cygwin, that it is using a POSIX
 standard,
 rather than a windows specific standard. Of course ideally, I would like my
 code to work
 on linux as well as windows, as I back up all of my data to a linux machine
 with
 samba.

Have you perhaps tried using the native Windows version of Python?

Cheers,
Chris
--
http://blog.rebertia.com
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: (Simple?) Unicode Question

2009-08-30 Thread Nobody

On Sun, 30 Aug 2009 02:36:49 +, Steven D'Aprano wrote:

 So long as your terminal has a sensible encoding, and you have a good
 quality font, you should be able to print any string you can create.
 
 UTF-8 isn't a particularly sensible encoding for terminals.
 
 Did I mention UTF-8?
 
 Out of curiosity, why do you say that UTF-8 isn't sensible for terminals?

I don't think I've ever seen a terminal (whether an emulator running on a
PC or a hardware terminal) which supports anything like the entire Unicode
repertoire, along with right-to-left writing, complex scripts, etc. Even
support for double-width characters is uncommon.

If your terminal can't handle anything outside of ISO-8859-1, there isn't
any advantage to using UTF-8, and some disadvantages; e.g. a typical Unix
tty driver will delete the last *byte* from the input buffer when you
press backspace (Linux 2.6.* has the IUTF8 flag, but this is non-standard).

Historically, terminal I/O has tended to revolve around unibyte encodings,
with everything except the endpoints being encoding-agnostic. Anything
which falls outside of that is a dog's breakfast; it's no coincidence
that the word for messed-up text (arising from an encoding mismatch)
was borrowed from Japanese (mojibake).

Life is simpler if you can use a unibyte encoding. Apart from anything
else, the failure modes tend to be harmless. E.g. you get the wrong glyph
rather than two glyphs where you expected one. On a 7-bit channel, you get
the wrong printable character rather than a control character (this is why
ISO-8859-* reserves \x80-\x9F as control codes rather than using them as
printable characters).

 And Unicode font is an oxymoron. You can merge a whole bunch of fonts
 together and stuff them into a TTF file; that doesn't make them a
 font, though.
 
 I never mentioned Unicode font either. In any case, there's no reason 
 why a skillful designer can't make a single font which covers the entire 
 Unicode range in a consistent style.

Consistency between unrelated scripts is neither realistic nor
desirable.

E.g. Latin fonts tend to use uniform stroke widths unless they're
specifically designed to look like handwriting, whereas Han fonts tend to
prefer variable-width strokes which reflect the direction.

 The main advantage of using Unicode internally is that you can associate
 encodings with the specific points where data needs to be converted
 to/from bytes, rather than having to carry the encoding details around
 the program.
 
 Surely the main advantage of Unicode is that it gives you a full and 
 consistent range of characters not limited to the 128 characters provided 
 by ASCII?

Nothing stops you from using other encodings, or from using multiple
encodings. But using multiple encodings means keeping track of the
encodings. This isn't impossible, and it may produce better results (e.g.
no information loss from Han unification), but it can be a lot more work.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: (Simple?) Unicode Question

2009-08-29 Thread Thorsten Kampe

* Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700)
  Further, does anything, except a printing device need to know the
  encoding of a piece of text?

Python needs to know if you are processing the text.
 
 I may be wrong, but I believe that's part of the idea between separation  
 of string and bytes types in Python 3.x. I believe, if you are using  
 Python 3.x, you don't need the character encoding mumbo jumbo at all ;-)

Nothing has changed in that regard. You still need to decode and encode 
text and for that you have to know the encoding.

Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: (Simple?) Unicode Question

2009-08-29 Thread Steven D'Aprano

On Sat, 29 Aug 2009 09:34:43 +0200, Thorsten Kampe wrote:

 * Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700)
  Further, does anything, except a printing device need to know the
  encoding of a piece of text?
 
 Python needs to know if you are processing the text.

Python only needs to know when you convert the text to or from bytes. I 
can do this:

 s = hello
 t = world
 print(' '.join([s, t]))
hello world

and not need to care anything about encodings.

So long as your terminal has a sensible encoding, and you have a good 
quality font, you should be able to print any string you can create.



 I may be wrong, but I believe that's part of the idea between
 separation of string and bytes types in Python 3.x. I believe, if you
 are using Python 3.x, you don't need the character encoding mumbo jumbo
 at all ;-)
 
 Nothing has changed in that regard. You still need to decode and encode
 text and for that you have to know the encoding.

You only need to worry about encoding when you convert from bytes to 
text, and visa versa. Admittedly, the most common time you need to do 
that is when reading input from files, but if all your text strings are 
generated by Python, and not output anywhere, you shouldn't need to care 
about encodings.

If all your text contains nothing but ASCII characters, you should never 
need to worry about encodings at all.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: (Simple?) Unicode Question

2009-08-29 Thread Nobody

On Sat, 29 Aug 2009 08:26:54 +, Steven D'Aprano wrote:

 Python only needs to know when you convert the text to or from bytes. I 
 can do this:
 
 s = hello
 t = world
 print(' '.join([s, t]))
 hello world
 
 and not need to care anything about encodings.
 
 So long as your terminal has a sensible encoding, and you have a good 
 quality font, you should be able to print any string you can create.

UTF-8 isn't a particularly sensible encoding for terminals.

And Unicode font is an oxymoron. You can merge a whole bunch of fonts
together and stuff them into a TTF file; that doesn't make them a font,
though.

 I may be wrong, but I believe that's part of the idea between
 separation of string and bytes types in Python 3.x. I believe, if you
 are using Python 3.x, you don't need the character encoding mumbo jumbo
 at all ;-)
 
 Nothing has changed in that regard. You still need to decode and encode
 text and for that you have to know the encoding.
 
 You only need to worry about encoding when you convert from bytes to 
 text, and visa versa. Admittedly, the most common time you need to do 
 that is when reading input from files, but if all your text strings are 
 generated by Python, and not output anywhere, you shouldn't need to care 
 about encodings.

Why would you generate text strings and not output them anywhere?

The main advantage of using Unicode internally is that you can associate
encodings with the specific points where data needs to be converted
to/from bytes, rather than having to carry the encoding details around the
program.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: (Simple?) Unicode Question

2009-08-29 Thread Steven D'Aprano

On Sat, 29 Aug 2009 20:09:12 +0100, Nobody wrote:

 On Sat, 29 Aug 2009 08:26:54 +, Steven D'Aprano wrote:
 
 Python only needs to know when you convert the text to or from bytes. I
 can do this:
 
 s = hello
 t = world
 print(' '.join([s, t]))
 hello world
 
 and not need to care anything about encodings.
 
 So long as your terminal has a sensible encoding, and you have a good
 quality font, you should be able to print any string you can create.
 
 UTF-8 isn't a particularly sensible encoding for terminals.

Did I mention UTF-8?

Out of curiosity, why do you say that UTF-8 isn't sensible for terminals?


 And Unicode font is an oxymoron. You can merge a whole bunch of fonts
 together and stuff them into a TTF file; that doesn't make them a
 font, though.

I never mentioned Unicode font either. In any case, there's no reason 
why a skillful designer can't make a single font which covers the entire 
Unicode range in a consistent style.


 I may be wrong, but I believe that's part of the idea between
 separation of string and bytes types in Python 3.x. I believe, if you
 are using Python 3.x, you don't need the character encoding mumbo
 jumbo at all ;-)
 
 Nothing has changed in that regard. You still need to decode and
 encode text and for that you have to know the encoding.
 
 You only need to worry about encoding when you convert from bytes to
 text, and visa versa. Admittedly, the most common time you need to do
 that is when reading input from files, but if all your text strings are
 generated by Python, and not output anywhere, you shouldn't need to
 care about encodings.
 
 Why would you generate text strings and not output them anywhere?

Who knows? It doesn't matter -- the point is that you can if you want to. 
You only need to worry about encodings at input and output, therefore 
logically if you don't do I/O you can process strings all day long and 
never worry about encodings at all.


 The main advantage of using Unicode internally is that you can associate
 encodings with the specific points where data needs to be converted
 to/from bytes, rather than having to carry the encoding details around
 the program.

Surely the main advantage of Unicode is that it gives you a full and 
consistent range of characters not limited to the 128 characters provided 
by ASCII?



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

(Simple?) Unicode Question

2009-08-27 Thread Shashank Singh

Hi All!

I have a very simple (and probably stupid) question eluding me.
When exactly is the char-set information needed?

To make my question clear consider reading a file.
While reading a file, all I get is basically an array of bytes.

Now suppose a file has 10 bytes in it (all is data, no metadata,
forget the BOM and stuff for a little while). I read it into an array of 10
bytes, replace, say, 2nd bytes and write all the bytes back to a new
file.

Do i need the character encoding mumbo jumbo anywhere in this?

Further, does anything, except a printing device need to know the
encoding of a piece of text? I mean, as long as we are not trying
to get a symbolic representation of a text or get ith character
of it, all we need to do is to carry the intended encoding as
an auxiliary information to the data stored as byte array.

Right?

--shashank
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: (Simple?) Unicode Question

2009-08-27 Thread Rami Chowdhury


Further, does anything, except a printing device need to know the
encoding of a piece of text?


I may be wrong, but I believe that's part of the idea between separation  
of string and bytes types in Python 3.x. I believe, if you are using  
Python 3.x, you don't need the character encoding mumbo jumbo at all ;-)


If you're using Python 2.x, though, I believe if you simply set the file  
opening mode to binary then data you read() should still be treated as an  
array of bytes, although you may encounter issues trying to access the  
n'th character.


Please do correct me if I'm wrong, anyone.

On Thu, 27 Aug 2009 09:39:06 -0700, Shashank Singh  
shashank.sunny.si...@gmail.com wrote:



Hi All!

I have a very simple (and probably stupid) question eluding me.
When exactly is the char-set information needed?

To make my question clear consider reading a file.
While reading a file, all I get is basically an array of bytes.

Now suppose a file has 10 bytes in it (all is data, no metadata,
forget the BOM and stuff for a little while). I read it into an array of  
10

bytes, replace, say, 2nd bytes and write all the bytes back to a new
file.

Do i need the character encoding mumbo jumbo anywhere in this?

Further, does anything, except a printing device need to know the
encoding of a piece of text? I mean, as long as we are not trying
to get a symbolic representation of a text or get ith character
of it, all we need to do is to carry the intended encoding as
an auxiliary information to the data stored as byte array.

Right?

--shashank




--
Rami Chowdhury
Never attribute to malice that which can be attributed to stupidity --  
Hanlon's Razor

408-597-7068 (US) / 07875-841-046 (UK) / 0189-245544 (BD)
--
http://mail.python.org/mailman/listinfo/python-list

Re: (Simple?) Unicode Question

2009-08-27 Thread Albert Hopkins

On Thu, 2009-08-27 at 22:09 +0530, Shashank Singh wrote:
 Hi All!
 
 I have a very simple (and probably stupid) question eluding me.
 When exactly is the char-set information needed?
 
 To make my question clear consider reading a file.
 While reading a file, all I get is basically an array of bytes.
 
 Now suppose a file has 10 bytes in it (all is data, no metadata,
 forget the BOM and stuff for a little while). I read it into an array
 of 10
 bytes, replace, say, 2nd bytes and write all the bytes back to a new
 file. 
 
 Do i need the character encoding mumbo jumbo anywhere in this?
 
 Further, does anything, except a printing device need to know the
 encoding of a piece of text? I mean, as long as we are not trying
 to get a symbolic representation of a text or get ith character
 of it, all we need to do is to carry the intended encoding as
 an auxiliary information to the data stored as byte array.

If you are just reading and writing bytes then you are just reading and
writing bytes.  Where you need to worry about unicode, etc. is when you
start treating a series of bytes as TEXT (e.g. how many *characters* are
in this byte array).* 

This is no different, IMO, than treating a byte stream vs a image file.
You don't, need to worry about resolution, palette, bit-depth, etc. if
you are only treating as a stream of bytes.  The only difference between
the two is that in Python unicode is a built-in type and image
isn't ;)

* Just make sure that if you are manipulating byte streams independent
of it's textual representation that you open files, e.g., in binary
mode.

-a


-- 
http://mail.python.org/mailman/listinfo/python-list

Unicode question

2006-07-28 Thread Ben Edwards (lists)

I am using python 2.4 on Ubuntu dapper, I am working through Dive into
Python.

There are a couple of inconsictencies.

Firstly sys.setdefaultencoding('iso−8859−1') does not work, I have to do
sys.setdefaultencoding = 'iso−8859−1'

secondly the following does not give a 'UnicodeError: ASCII encoding
error:', and I would expect ti to.  In fact it prints out the n with ~
above it fine:

sys.setdefaultencoding = 'ascii'
s = u'La Pe\xf1a'
print s

Any insight?
Ben


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode question

2006-07-28 Thread Max Erickson

Ben Edwards (lists) [EMAIL PROTECTED] wrote:

 I am using python 2.4 on Ubuntu dapper, I am working through Dive
 into Python.
...
 Any insight?
 Ben


Did you follow all the instructions, or did you try to call 
sys.setdefaultencoding interactively?

See:

http://diveintopython.org/xml_processing/unicode.html#kgp.unicode.4.1


hope this helps,
max

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode question

2006-07-28 Thread Steve M

Ben Edwards (lists) wrote:
 I am using python 2.4 on Ubuntu dapper, I am working through Dive into
 Python.

 There are a couple of inconsictencies.

 Firstly sys.setdefaultencoding('iso-8859-1') does not work, I have to do
 sys.setdefaultencoding = 'iso-8859-1'

When you run a Python script, the interpreter does some of its own
stuff before executing your script. One of the things it does is to
delete the name sys.setdefaultencoding. This means that by the time
even your first line of code runs that name no longer exists and so you
will be unable to invoke the function as in your first attempt.

The second attempt sys.setdefaultencoding = 'iso-8859-1'  is creating
a new name under the sys namespace and assigning it a string. This will
not have the desired effect, or probably any effect at all.

I have found that in order to change the default encoding with that
function, you can put the command in a file called sitecustomize.py
which, when placed in the appropriate location (which is
platform-dependent), will be called in time to have the desired effect.

So the order of events is something like:
1. Invoke Python on myscript.py
2. Python does some stuff and then executes sitecustomize.py
3. Python deletes the name sys.setdefaultencoding, thereby making the
function that was so-named inaccessible.
4. Python then begins executing myscript.py.


Regarding the location of sitecustomize.py, on Windows it is
C:\Python24\Lib\sitecustomize.py.

My guess is that you should put it in the same directory as the bulk of
the Python standard library files. (Also in that directory is a
subdirectory called site-packages, where you can put custom modules
that will be available for import from any of your scripts.)

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode question

2006-07-28 Thread Martin v. Löwis

Ben Edwards (lists) wrote:
 Firstly sys.setdefaultencoding('iso−8859−1') does not work, I have to do
 sys.setdefaultencoding = 'iso−8859−1'

That works, but has no effect. You bind the variable
sys.setdefaultencoding to some value, but that value is never used for
anything (do sys.getdefaultencoding() to see what I mean). You could
just as well write

sys.standardkodierung = 'iso-8859-1'

 secondly the following does not give a 'UnicodeError: ASCII encoding
 error:', and I would expect ti to.  In fact it prints out the n with ~
 above it fine:
 
 sys.setdefaultencoding = 'ascii'
 s = u'La Pe\xf1a'
 print s
 
 Any insight?

The print statement uses sys.stdout.encoding, not the default encoding.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

[OT] Re: a unicode question?

2006-04-11 Thread Peter Otten

John Machin wrote:

 ... and yes Peter, info travels faster also from China that it does
 from Armenia :-())

Q: Can info travel faster from Armenia than from China?
Radio Yerevan: In principle, yes. Just make sure that it doesn't go the
other way round the globe or meets some friends on the way...
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a unicode question?

2006-04-10 Thread Serge Orlov


[EMAIL PROTECTED] wrote:
 Mr. John Machin

 This question come form the flow codes. I use the PyXml to build a DOM
 tree.

 from xml.dom.ext.reader import HtmlLib
 doc =
 HtmlLib.FromHtmlUrl('http://stock.business.sohu.com/q/nbcg.php?code=600028')
 title_elem = doc.documentElement.getElementsByTagName(TITLE)[0]
 title_string = title_elem.firstChild.data
 print title_string

 # the title_string is unicode, but it is not latin1 code, so I wantto
 change it.

Errr, but the title of the page is written in Chinese and it is not
supposed to be crammed into latin1 encoding. What are you trying to do
with the string after you squeezed Chinese into latin1?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a unicode question?

2006-04-10 Thread John Machin

E, it get's worse: not only is the title written in Chinese, it
is encoded as gb2312 -- here is the repr() of the first few chunks:

html\nhead\ntitle\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) :
\xc4\xd
a\xb2\xbf\xc8\xcb\xd4\xb1\xb3\xd6\xb9\xc9 -
\xcb\xd1\xba\xfc\xb9\xc9\xc6\xb1/ti
tle\nmeta http-equiv='Content-Type' content='text/html;
charset=gb2312'\n

and here is what you get after that_guff.decode('gb2312')

uhtml\nhead\ntitle\u4e2d\u56fd\u77f3\u5316(600028) :
\u5185\u90e8\u
4eba\u5458\u6301\u80a1 - \u641c\u72d0\u80a1\u7968/title\nmeta
http-equiv='Con
tent-Type' content='text/html; charset=gb2312'\n

The first 2 characters of the title are recognisable both visually on
the browser title and in the unicode as zhong guo i.e. China.

BUT the OP's first message is interpreting that gb2312-encoded stuff as
Unicode:
s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '

*SOMEBODY* is seriously deluded, and it ain't me, and it ain't Serge
:-)

... and yes Peter, info travels faster also from China that it does
from Armenia :-())

-- 
http://mail.python.org/mailman/listinfo/python-list

a unicode question?

2006-04-09 Thread zdwang

Hello,
   There is a unicode string, I want to change it to ansi string. but
it raise an exception.
   Could you help me?

##  I want to change s1 to s2.

   s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '
   
   s2 = '\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a unicode question?

2006-04-09 Thread John Machin

What do you mean by ansi string?

Here is a superficially not-unreasonable answer to your more specific
question:

#  s1 = u'\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '
#  s2 = '\xd6\xd0\xb9\xfa\xca\xaf\xbb\xaf(600028) '
#  s3 = s1.encode('latin1')
#  s2 == s3
# True

But what are you really trying to achieve? Where does your Unicode data
come from? What ranges of characters do you expect it to contain? You
need to crunch it into an 8-bit representation because ... what?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a unicode question?

2006-04-09 Thread zdwang

Mr. John Machin, Thank you very much!

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a unicode question?

2006-04-09 Thread zdwang

Mr. John Machin

This question come form the flow codes. I use the PyXml to build a DOM
tree.

from xml.dom.ext.reader import HtmlLib
doc =
HtmlLib.FromHtmlUrl('http://stock.business.sohu.com/q/nbcg.php?code=600028')
title_elem = doc.documentElement.getElementsByTagName(TITLE)[0]
title_string = title_elem.firstChild.data
print title_string

# the title_string is unicode, but it is not latin1 code, so I wantto
change it.

-- 
http://mail.python.org/mailman/listinfo/python-list

Unicode question : turn José into uJosé

2006-04-05 Thread Ian Sparks

This is probably stupid and/or misguided but supposing I'm passed a byte-string 
value that I want to be unicode, this is what I do. I'm sure I'm missing 
something very important.

Short version :

 s = José #Start with non-unicode string
 unicoded = eval(u'%s' % José)

Long version :

 s = José #Start with non-unicode string
 s  #Lets look at it
'Jos\xe9'
 escaped = s.encode('string_escape')  
 escaped
'Jos\\xe9'
 unicoded = eval(u'%s' % escaped)
 unicoded
u'Jos\xe9'

 test = uJosé   #What they should have passed me
 test == unicoded #Am I really getting the same thing?
True #Yay!




-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode question : turn José into uJosé

2006-04-05 Thread aurora

First of all, if you run this on the console, find out your console's  
encoding. In my case it is English Windows XP. It uses 'cp437'.

C:\chcp
Active code page: 437

Then

 s = José
 u = uJos\u00e9 # same thing in unicode escape
 s.decode('cp437') == u   # use encoding that match your console
True


wy




 This is probably stupid and/or misguided but supposing I'm passed a  
 byte-string value that I want to be unicode, this is what I do. I'm sure  
 I'm missing something very important.

 Short version :

 s = José #Start with non-unicode string
 unicoded = eval(u'%s' % José)

 Long version :

 s = José #Start with non-unicode string
 s  #Lets look at it
 'Jos\xe9'
 escaped = s.encode('string_escape')
 escaped
 'Jos\\xe9'
 unicoded = eval(u'%s' % escaped)
 unicoded
 u'Jos\xe9'

 test = uJosé   #What they should have passed me
 test == unicoded #Am I really getting the same thing?
 True #Yay!





-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode question : turn José into uJosé

2006-04-05 Thread ianaré

maybe a bit off topic, but how does one find the console's encoding
from within python?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode question : turn José into uJosé

2006-04-05 Thread John Machin

The most important thing that you are missing is that you need to know
the encoding used for the 8-bit-character string. Let's guess that it's
Latin1.
Then all you have to do is use the unicode() builtin function, or the
string decode method.
#  s =  'Jos\xe9'
#  s
# 'Jos\xe9'
#  u = unicode(s, 'latin1')
#  u
# u'Jos\xe9'
#  u2 = s.decode('latin1')
#  u2
# u'Jos\xe9'

Other important things:
(1) Using eval() is not usually the best way to do things.
(2) If your code is not in entirely in ASCII, put a coding declaration
at the top of the source file.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode question : turn José into uJosé

2006-04-05 Thread Kent Johnson

ianaré wrote:
 maybe a bit off topic, but how does one find the console's encoding
 from within python?
 
In [1]: import sys

In [3]: sys.stdout.encoding
Out[3]: 'cp437'

In [4]: sys.stdin.encoding
Out[4]: 'cp437'

Kent
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode question : turn José into uJosé

2006-04-05 Thread Ben Finney

Ian Sparks [EMAIL PROTECTED] writes:

 This is probably stupid and/or misguided but supposing I'm passed a
 byte-string value that I want to be unicode, this is what I do. I'm
 sure I'm missing something very important.

Perhaps you need to read one of the good Python Unicode tutorials,
such as:

URL:http://effbot.org/zone/unicode-objects.htm

 Short version :
 
  s = José #Start with non-unicode string

In what encoding? Once you step outside the ASCII character set, you
*must* be explicit about the encoding used for the text. Because there
is no sure way to infer it, Python refuses to guess.

If you're going to include literal non-ASCII characters in the code
(which is the simplest and most readable way), you must also tell
Python what encoding to use when it reads the source file.

URL:http://docs.python.org/ref/encodings.html

  unicoded = eval(u'%s' % José)

Once you know the encoding, you can simply say::

 str_encoding = iso-8859-1
 str = José
 unicode_str = str.decode(str_encoding)

(Note that I didn't type this using the iso-8859-1 encoding, so it's
likely to be wrong in that respect; you'll need to change it to match
your situation.)

-- 
 \To me, boxing is like a ballet, except there's no music, no |
  `\choreography, and the dancers hit each other.  -- Jack Handey |
_o__)  |
Ben Finney

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode question

2006-03-01 Thread Walter Dörwald

Edward Loper wrote:

 Walter Dörwald wrote:
 Edward Loper wrote:

 [...]
 Surely there's a better way than converting back and forth 3 times?  Is
 there a reason that the 'backslashreplace' error mode can't be used 
 with codecs.decode?

   'abc \xff\xe8 def'.decode('ascii', 'backslashreplace')
 Traceback (most recent call last):
File stdin, line 1, in ?
 TypeError: don't know how to handle UnicodeDecodeError in error callback

 The backslashreplace error handler is an *error* *handler*, i.e. it 
 gives you a replacement text if an input character can't be encoded. 
 But a backslash character in an 8bit string is no error, so it won't 
 get replaced on decoding.
 
 I'm not sure I follow exactly -- the input string I gave as an example 
 did not contain any backslash characters.  Unless by backslash 
 character you mean a character c such that ord(c)127.  I guess it 
 depends on which class of errors you think the error handler should be 
 handling. :)  The codec system's pretty complex, so I'm willing to
 accept on faith that there may be a good reason to have error handlers 
 only make replacements in the encode direction, and not in the decode 
 direction.

Both directions are completely non-symmetric. On encoding an error can 
only happen when the character is unencodable (e.g. for charmap codecs 
anything outside the set of 256 characters). On decoding an error means 
that the byte stream violates the internal format of the encoding. But a 
0x5c byte (i.e. a backslash) in e.g. a latin-1 byte sequence doesn't 
violate the internal format of the latin-1 encoding (nothing does), so 
the error handler never kicks in.

 What you want is a different codec (try e.g. string-escape or 
 unicode-escape).
 
 This is very close, but unfortunately won't quite work for my purposes, 
 because it also puts backslashes before ' and \\ and maybe a few 
 other characters.  :-/

OK, seems you're stuck with your decode/encode/decode call.

   print test: '\xff'.encode('string-escape').decode('ascii')
 test: \'\xff\'
 
   print do_what_i_want(test:\xff')
 test: '\xff'
 
 I think I'll just have to stick with rolling my own.

Bye,
Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode question

2006-02-27 Thread Walter Dörwald

Edward Loper wrote:

 [...]
 Surely there's a better way than converting back and forth 3 times?  Is
 there a reason that the 'backslashreplace' error mode can't be used with 
 codecs.decode?
 
   'abc \xff\xe8 def'.decode('ascii', 'backslashreplace')
 Traceback (most recent call last):
File stdin, line 1, in ?
 TypeError: don't know how to handle UnicodeDecodeError in error callback

The backslashreplace error handler is an *error* *handler*, i.e. it 
gives you a replacement text if an input character can't be encoded. But 
a backslash character in an 8bit string is no error, so it won't get 
replaced on decoding.

What you want is a different codec (try e.g. string-escape or 
unicode-escape).

Bye,
Walter Dörwald

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode question

2006-02-27 Thread Edward Loper

Walter Dörwald wrote:
 Edward Loper wrote:
 
 [...]
 Surely there's a better way than converting back and forth 3 times?  Is
 there a reason that the 'backslashreplace' error mode can't be used 
 with codecs.decode?

   'abc \xff\xe8 def'.decode('ascii', 'backslashreplace')
 Traceback (most recent call last):
File stdin, line 1, in ?
 TypeError: don't know how to handle UnicodeDecodeError in error callback
 
 The backslashreplace error handler is an *error* *handler*, i.e. it 
 gives you a replacement text if an input character can't be encoded. But 
 a backslash character in an 8bit string is no error, so it won't get 
 replaced on decoding.

I'm not sure I follow exactly -- the input string I gave as an example 
did not contain any backslash characters.  Unless by backslash 
character you mean a character c such that ord(c)127.  I guess it 
depends on which class of errors you think the error handler should be 
handling. :)  The codec system's pretty complex, so I'm willing to 
accept on faith that there may be a good reason to have error handlers 
only make replacements in the encode direction, and not in the decode 
direction.

 What you want is a different codec (try e.g. string-escape or 
 unicode-escape).

This is very close, but unfortunately won't quite work for my purposes, 
because it also puts backslashes before ' and \\ and maybe a few 
other characters.  :-/

  print test: '\xff'.encode('string-escape').decode('ascii')
test: \'\xff\'

  print do_what_i_want(test:\xff')
test: '\xff'

I think I'll just have to stick with rolling my own.

-Edward

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode question

2006-02-25 Thread Tim Roberts

Edward Loper [EMAIL PROTECTED] wrote:

I would like to convert an 8-bit string (i.e., a str) into unicode,
treating chars \x00-\x7f as ascii, and converting any chars \x80-xff
into a backslashed escape sequences.  I.e., I want something like this:

  decode_with_backslashreplace('abc \xff\xe8 def')
u'abc \\xff\\xe8 def'

The best I could come up with was:

   def decode_with_backslashreplace(s):
   str - unicode
   return (s.decode('latin1')
.encode('ascii', 'backslashreplace')
.decode('ascii'))

Surely there's a better way than converting back and forth 3 times?

I didn't check whether this was faster, although I rather suspect it is
not:

  cvt = lambda x: ord(x)0x80 and x or '\\x'+hex(ord(x))
  def decode_with_backslashreplace(s):
  return ''.join(map(cvt,s))
-- 
- Tim Roberts, [EMAIL PROTECTED]
  Providenza  Boekelheide, Inc.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode question

2006-02-25 Thread Kent Johnson

Edward Loper wrote:
 I would like to convert an 8-bit string (i.e., a str) into unicode,
 treating chars \x00-\x7f as ascii, and converting any chars \x80-xff
 into a backslashed escape sequences.  I.e., I want something like this:
 
   decode_with_backslashreplace('abc \xff\xe8 def')
 u'abc \\xff\\xe8 def'

   s='abc \xff\xe8 def'
   s.encode('string_escape')
'abc \\xff\\xe8 def'
   unicode(s.encode('string_escape'))
u'abc \\xff\\xe8 def'

Kent
-- 
http://mail.python.org/mailman/listinfo/python-list

unicode question

2006-02-24 Thread Edward Loper

I would like to convert an 8-bit string (i.e., a str) into unicode,
treating chars \x00-\x7f as ascii, and converting any chars \x80-xff
into a backslashed escape sequences.  I.e., I want something like this:

  decode_with_backslashreplace('abc \xff\xe8 def')
u'abc \\xff\\xe8 def'

The best I could come up with was:

   def decode_with_backslashreplace(s):
   str - unicode
   return (s.decode('latin1')
.encode('ascii', 'backslashreplace')
.decode('ascii'))

Surely there's a better way than converting back and forth 3 times?  Is
there a reason that the 'backslashreplace' error mode can't be used with 
codecs.decode?

  'abc \xff\xe8 def'.decode('ascii', 'backslashreplace')
Traceback (most recent call last):
   File stdin, line 1, in ?
TypeError: don't know how to handle UnicodeDecodeError in error callback

-Edward

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode Question

2006-01-09 Thread Erik Max Francis

David Pratt wrote:

 This is not working for me. Can someone explain why. Many thanks.

Because '\xbe' isn't UTF-8 for the character you want, '\xc2\xbe' is, as 
you just showed yourself in the code snippet.

-- 
Erik Max Francis  [EMAIL PROTECTED]  http://www.alcyone.com/max/
San Jose, CA, USA  37 20 N 121 53 W  AIM erikmaxfrancis
   Where are they?
   -- Enrico Fermi, 1901-1954
-- 
http://mail.python.org/mailman/listinfo/python-list

Unicode Question

2006-01-09 Thread David Pratt

Hi. I am working through some tutorials on unicode and am hoping that 
someone can help explain this for me.  I am on mac platform using python 
2.4.1 at the moment.  I am experimenting with unicode with the 3/4 symbol.

I want to prepare strings for db storage that come from normal Windows 
machine (cp1252) so my understanding is to unicode and encode to utf-8 
and to store properly. Since data will be used on the web I would not 
have to change my encoding when extracting from the database. This first 
example I believe simulates this with the 3/4 symbol. Here I want to 
store '\xc2\xbe' in my database.

  tq = u'\xbe'
  tq_utf = tq.encode('utf8')
  tq, tq_utf
(u'\xbe', '\xc2\xbe')

To unicode withat a valiable, my understanding is that I can unicode and 
encode at the same time

  tq = '\xbe'
  tq_utf = unicode(tq, 'utf-8')
Traceback (most recent call last):
   File stdin, line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position 0: 
unexpected code byte

This is not working for me. Can someone explain why. Many thanks.

Regards,
David
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode Question

2006-01-09 Thread Max Erickson

The encoding argument to unicode() is used to specify the encoding of the 
string that you want to translate into unicode. The interpreter stores 
unicode as unicode, it isn't encoded...

 unicode('\xbe','cp1252')
u'\xbe'
 unicode('\xbe','cp1252').encode('utf-8')
'\xc2\xbe'
 


max

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode Question

2006-01-09 Thread David Pratt

Hi Martin. Many thanks for your reply. What I am reall after, the 
following accomplishes.
 
 If you are looking for at the same time, perhaps this is also
 interesting:
 
 py unicode('\xbe', 'windows-1252').encode('utf-8')
 '\xc2\xbe'
 

Your answer really helped quite a bit to clarify this for me. I am using 
sqlite3 so it is very happy to have utf-8 encoded unicode.

The examples you provided were the additional help I needed. Thank you.

Regards,
David
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode Question

2006-01-09 Thread David Pratt

Hi Erik. Thank you for your reply. The advice I has helped clarify this 
for me.

Regards,
David

Erik Max Francis wrote:
 David Pratt wrote:
 
 
This is not working for me. Can someone explain why. Many thanks.
 
 
 Because '\xbe' isn't UTF-8 for the character you want, '\xc2\xbe' is, as 
 you just showed yourself in the code snippet.
 
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode Question

2006-01-09 Thread David Pratt

Hi Max. Many thanks for helping to realize where I was missing the point 
and making this clearer.

Regards,
David

Max Erickson wrote:
 The encoding argument to unicode() is used to specify the encoding of the 
 string that you want to translate into unicode. The interpreter stores 
 unicode as unicode, it isn't encoded...
 
 
unicode('\xbe','cp1252')
 
 u'\xbe'
 
unicode('\xbe','cp1252').encode('utf-8')
 
 '\xc2\xbe'
 
 
 
 max
 
-- 
http://mail.python.org/mailman/listinfo/python-list

Once again a unicode question

2005-03-26 Thread Nicolas Evrard

Hello,
I'm puzzled by this test I made while trying to transform a page in
html to plain text. Because I cannot send unicode to feed, nor str so
how can I do this ?
[EMAIL PROTECTED]:~$ python2.4
.Python 2.4.1c2 (#2, Mar 19 2005, 01:04:19) 
.[GCC 3.3.5 (Debian 1:3.3.5-12)] on linux2
.Type help, copyright, credits or license for more information.
. import formatter
. import htmllib
. html2txt = htmllib.HTMLParser(formatter.AbstractFormatter(formatter.DumbWriter()))
. html2txt.feed(u'D\xe9but')
.Traceback (most recent call last):
.  File stdin, line 1, in ?
.  File /usr/lib/python2.4/sgmllib.py, line 95, in feed
.self.goahead(0)
.  File /usr/lib/python2.4/sgmllib.py, line 120, in goahead
.self.handle_data(rawdata[i:j])
.  File /usr/lib/python2.4/htmllib.py, line 65, in handle_data
.self.formatter.add_flowing_data(data)
.  File /usr/lib/python2.4/formatter.py, line 197, in add_flowing_data
.self.writer.send_flowing_data(data)
.  File /usr/lib/python2.4/formatter.py, line 421, in send_flowing_data
.write(word)
.UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
. html2txt.feed(u'D\xe9but'.encode('latin1'))
.Traceback (most recent call last):
.  File stdin, line 1, in ?
.  File /usr/lib/python2.4/sgmllib.py, line 94, in feed
.self.rawdata = self.rawdata + data
.UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128)
. html2txt.feed('Début')
.Traceback (most recent call last):
.  File stdin, line 1, in ?
.  File /usr/lib/python2.4/sgmllib.py, line 94, in feed
.self.rawdata = self.rawdata + data
.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
.

--
(°  Nicolas Évrard
/ )  Liège - Belgique
^^
--
http://mail.python.org/mailman/listinfo/python-list

Re: Once again a unicode question

2005-03-26 Thread Serge Orlov

Nicolas Evrard wrote:
 Hello,

 I'm puzzled by this test I made while trying to transform a page in
 html to plain text. Because I cannot send unicode to feed, nor str so
 how can I do this ?

Seems like the parser is in the broken state after the first exception.
Feed only binary strings to it.

  Serge.


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Once again a unicode question

2005-03-26 Thread Nicolas Evrard

* Serge Orlov  [23:45 26/03/05 CET]: 
Nicolas Evrard wrote:
Hello,
I'm puzzled by this test I made while trying to transform a page in
html to plain text. Because I cannot send unicode to feed, nor str so
how can I do this ?
Seems like the parser is in the broken state after the first exception.
Feed only binary strings to it.
That was that thank you very much.
--
(°  Nicolas Évrard
/ )  Liège - Belgique
^^
--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode question

2004-11-29 Thread Bengt Richter

On Tue, 23 Nov 2004 20:37:04 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= 
[EMAIL PROTECTED] wrote:

Steve Holden wrote:
 Am I the only person who found it scary that Bengt could apparently 
 casually drop on a polynomial the would decode to  Löwis?
Well, don't give me too much credit, though I admit enjoying a little unearned
flattered-ego buzz ;-) But it's not a big deal if you had recently implemented
an automatic lambda-printer-outer to solve for a polynomial function f such that
f(0)==k0, f(1)==k1, .. f(n)==kn. For a single number k0 that will be lambda x: 
k0
and for two numbers k0, k1 will be lambda x: k0 + x*(k1-k0) etc. It's a matter 
of
solving some simultaneous equations for the coefficient values, which I had done
in response to a previous thread. For that, I happened to have had some 
experience
from the '60s writing variations on an equation solver (back when we 
congratulated
ourselves on getting all (software-implemented) floating point ops other than 
divide
to execute in under a millisecond ;-) Here I was using an exact decimal module 
I happened
to have (also built in response to previous thread discussion ;-), so I didn't 
even have
to look for maximum abs pivot elements in the matrix for this one. And it 
didn't have to be fast.
So it was kind of a fun exercise. But anyway, it was all ready to go at this 
point, so
all I had to was do was run coeffsx.py with the character ord values as args on 
the command line.
The opportunity to use it in a fun way to fake casual wizardry was just dumb 
luck ;-)


I'm not scared, but honored, of course.

A bit late responding, but I couldn't think of a clever followup to that ;-)
But Just to play fair,

print ''.join([chr((lambda x: (
-6244372133*x**31 +3013910052086*x**30 -695396351572920*x**29
+102105752307741620*x**28 -10715303804974659632*x**27 
+855734314951919397204*x**26
-54067713339116101354860*x**25 +2774121296568607137441900*x**24
-117725625258165396333623970*x**23 +4187405270602160539007125440*x**22
-126060225187601954901807327900*x**21 +3234908736910295469078183101700*x**20
-71121878980966418114205095297640*x**19 
+1344268902923717571167117226451980*x**18
-21886601404074660751245403749948900*x**17 
+307180698948793841846368910776059300*x**16
-3714719218772170154406066269371644945*x**15 
+38641327091060849304069885597725238090*x**14
-344757809926306996671359721670334393500*x**13 
+2627069115710241704477921121071756668600*x**12
-16998869426095431823754237370045113150352*x**11 
+92697362475995606001274610327169882407584*x**10
-421837211162827653880286870838716820642880*x**9 
+1581695033356657201434736494281105646218880*x**8
-4805817748883837636614530805204695373091328*x**7 
+11572394080794032785251889126742747327087616*x**6
-2141782094441901308037452513456003159040*x**5 
+29141767437911436346798089144038222112768000*x**4
-2718608642882609434610843144764478140416*x**3 
+1533994355659295223664305312404777140224*x**2
-388225373807829537910251710026682204160*x 
+23023948231698183889631576064000)
/274094621805930760590852096000
)(x)) for x in xrange(32)])

Not-ready-to-be-mythologized-though-plenty-flatterable-ly y'rs

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list

91 matches

Mail list logo