Re: Encoding problem in python
If you use Arabic frequently on your system, I suggest to change your windows system locale from Region and Language in control panel (Administrative tab) and set to Arabic. -- http://mail.python.org/mailman/listinfo/python-list
Re: Encoding problem in python
On 2013-03-04 10:37, yomnasala...@gmail.com wrote: I have a problem with encoding in python 27 shell. when i write this in the python shell: w=u'العربى' It gives me the following error: Unsupported characters in input any help? Maybe it is not Python related. Did you get an exception? Can you send a full traceback? I suspect that the error comes from your terminal, and not Python. Please make sure that your terminal supports UTF-8 encoding. Alternatively, try creating a file with this content: # -*- encoding: UTF-8 -*- w=u'العربى' Save it as UTF-8 encoded file test.py (with an UTF-8 compatible editor, for example Geany) and run it as a command: python test.py If it works then it is sure that the problem is with your terminal. It will be an OS limitation, not Python's limitation. Best, Laszlo -- http://mail.python.org/mailman/listinfo/python-list
Re: Encoding problem in python
On Mon, 04 Mar 2013 01:37:42 -0800, yomnasalah91 wrote: I have a problem with encoding in python 27 shell. when i write this in the python shell: w=u'العربى' It gives me the following error: Unsupported characters in input any help? Firstly, please show the COMPLETE error, including the full traceback. Python errors look like (for example): py x = ord(100) Traceback (most recent call last): File stdin, line 1, in module TypeError: ord() expected string of length 1, but int found Copy and paste the complete traceback. Secondly, please describe your environment: - What operating system and version are you using? Linux, Windows, Mac OS, something else? Which version or distro? - Which console or terminal application? E.g. cmd.exe (Windows), konsole, xterm, something else? - Which shell? E.g. the standard Python interpreter, IDLE, bpython, something else? My guess is that this is not a Python problem, but an issue with your console. You should always have your console set to use UTF-8, if possible. I expect that your console is set to use a different encoding. In that case, see if you can change it to UTF-8. For example, using Gnome Terminal on Linux, I can do this: py w = u'العربى' py print w العربى and it works fine, but if I change the encoding to WINDOWS-1252 using the Set character encoding menu command, the terminal will not allow me to paste the string into the terminal. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Encoding problem in python
2013/3/4 yomnasala...@gmail.com: I have a problem with encoding in python 27 shell. when i write this in the python shell: w=u'العربى' It gives me the following error: Unsupported characters in input any help? -- http://mail.python.org/mailman/listinfo/python-list Hi, I guess, you are using the built-in IDLE shell with python 2.7 and this is a specific limitation of its handling of some unicode characters (in some builds and OSes - narrow-unicode, Windows, most likely?) and its specific error message - not the usual python traceback mentioned in other posts). If it is viable, using python 3.3 instead would solve this problem for IDLE: Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)] on win32 Type copyright, credits or license() for more information. w='العربى' w 'العربى' (note the missing u in unicode literal before the starting quotation mark, which would be the usual usage in python 3, but python 3.3 also silently ignores u... for compatibility.) w=u'العربى' w 'العربى' If python 2.7 is required, another shell is probably needed (unless I am missing some option to make IDLE work for this input); e.g. the following works in pyshell - part of the wxpython GUI library http://www.wxpython.org/ w=u'العربى' w u'\u0627\u0644\u0639\u0631\u0628\u0649' print w العربى hth, vbr -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
On Wed, 05 Oct 2011 21:39:17 -0700, Greg wrote: Here is the final code for those who are struggling with similar problems: ## open and decode file # In this case, the encoding comes from the charset argument in a meta tag # e.g. meta charset=iso-8859-2 fileObj = open(filePath,r).read() fileContent = fileObj.decode(iso-8859-2) fileSoup = BeautifulSoup(fileContent) The fileObj.decode() step should be unnecessary, and is usually undesirable; Beautiful Soup should be doing the decoding itself. If you actually know the encoding (e.g. from a Content-Type header), you can specify it via the fromEncoding parameter to the BeautifulSoup constructor, e.g.: fileSoup = BeautifulSoup(fileObj.read(), fromEncoding=iso-8859-2) If you don't specify the encoding, it will be deduced from a meta tag if one is present, or a Unicode BOM, or using the chardet library if available, or using built-in heuristics, before finally falling back to Windows-1252 (which seems to be the preferred encoding of people who don't understand what an encoding is or why it needs to be specified). -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
Am 06.10.2011 05:40, schrieb Steven D'Aprano: (4) Do all your processing in Unicode, not bytes. (5) Encode the text into bytes using UTF-8 encoding. (6) Write the bytes to a file. Just wondering, why do you split the latter two parts? I would have used codecs.open() to open the file and define the encoding in a single step. Is there a downside to this approach? Otherwise, I can only confirm that your overall approach is the easiest way to get correct results. Uli -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
On Thu, Oct 6, 2011 at 8:29 PM, Ulrich Eckhardt ulrich.eckha...@dominalaser.com wrote: Just wondering, why do you split the latter two parts? I would have used codecs.open() to open the file and define the encoding in a single step. Is there a downside to this approach? Those two steps still happen, even if you achieve them in a single function call. What Steven described is language- and library- independent. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
On 6 oct, 06:39, Greg gregor.hochsch...@googlemail.com wrote: Brilliant! It worked. Thanks! Here is the final code for those who are struggling with similar problems: ## open and decode file # In this case, the encoding comes from the charset argument in a meta tag # e.g. meta charset=iso-8859-2 fileObj = open(filePath,r).read() fileContent = fileObj.decode(iso-8859-2) fileSoup = BeautifulSoup(fileContent) ## Do some BeautifulSoup magic and preserve unicode, presume result is saved in 'text' ## ## write extracted text to file f = open(outFilePath, 'w') f.write(text.encode('utf-8')) f.close() or (Python2/Python3) import io with io.open('abc.txt', 'r', encoding='iso-8859-2') as f: ... r = f.read() ... repr(r) u'a\nb\nc\n' with io.open('def.txt', 'w', encoding='utf-8-sig') as f: ... t = f.write(r) ... f.closed True jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
On Thursday 2011 October 06 10:41, jmfauth wrote: or (Python2/Python3) import io with io.open('abc.txt', 'r', encoding='iso-8859-2') as f: ... r = f.read() ... repr(r) u'a\nb\nc\n' with io.open('def.txt', 'w', encoding='utf-8-sig') as f: ... t = f.write(r) ... f.closed True jmf What is this io of which you speak? -- I have seen the future and I am not in it. -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
In mailman.1785.1317928997.27778.python-l...@python.org xDog Walker thud...@gmail.com writes: What is this io of which you speak? It was introduced in Python 2.6. -- John Gordon A is for Amy, who fell down the stairs gor...@panix.com B is for Basil, assaulted by bears -- Edward Gorey, The Gashlycrumb Tinies -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote: Hi, I am having some encoding problems when I first parse stuff from a non-english website using BeautifulSoup and then write the results to a txt file. If you haven't already read this, you should do so: http://www.joelonsoftware.com/articles/Unicode.html I have the text both as a normal (text) and as a unicode string (utext): print repr(text) 'Branie zak\xc2\xb3adnik\xc3\xb3w' This is pretty much meaningless, because we don't know how you got the text and what it actually is. You're showing us a bunch of bytes, with no clue as to whether they are the right bytes or not. Considering that your Unicode text is also incorrect, I would say it is *not* right and your description of the problem is 100% backwards: the problem is not *writing* the text, but *reading* the bytes and decoding it. You should do something like this: (1) Inspect the web page to find out what encoding is actually used. (2) If the web page doesn't know what encoding it uses, or if it uses bits and pieces of different encodings, then the source is broken and you shouldn't expect much better results. You could try guessing, but you should expect mojibake in your results. http://en.wikipedia.org/wiki/Mojibake (3) Decode the web page into Unicode text, using the correct encoding. (4) Do all your processing in Unicode, not bytes. (5) Encode the text into bytes using UTF-8 encoding. (6) Write the bytes to a file. [...] Now I am trying to save this to a file but I never get the encoding right. Here is what I tried (+ lot's of different things with encode, decode...): outFile=codecs.open( filePath, w, UTF8 ) outFile.write(utext) outFile.close() That's the correct approach, but it won't help you if utext contains the wrong characters in the first place. The critical step is taking the bytes in the web page and turning them into text. How are you generating utext? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
Brilliant! It worked. Thanks! Here is the final code for those who are struggling with similar problems: ## open and decode file # In this case, the encoding comes from the charset argument in a meta tag # e.g. meta charset=iso-8859-2 fileObj = open(filePath,r).read() fileContent = fileObj.decode(iso-8859-2) fileSoup = BeautifulSoup(fileContent) ## Do some BeautifulSoup magic and preserve unicode, presume result is saved in 'text' ## ## write extracted text to file f = open(outFilePath, 'w') f.write(text.encode('utf-8')) f.close() On Oct 5, 11:40 pm, Steven D'Aprano steve +comp.lang.pyt...@pearwood.info wrote: On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote: Hi, I am having some encoding problems when I first parse stuff from a non-english website using BeautifulSoup and then write the results to a txt file. If you haven't already read this, you should do so: http://www.joelonsoftware.com/articles/Unicode.html I have the text both as a normal (text) and as a unicode string (utext): print repr(text) 'Branie zak\xc2\xb3adnik\xc3\xb3w' This is pretty much meaningless, because we don't know how you got the text and what it actually is. You're showing us a bunch of bytes, with no clue as to whether they are the right bytes or not. Considering that your Unicode text is also incorrect, I would say it is *not* right and your description of the problem is 100% backwards: the problem is not *writing* the text, but *reading* the bytes and decoding it. You should do something like this: (1) Inspect the web page to find out what encoding is actually used. (2) If the web page doesn't know what encoding it uses, or if it uses bits and pieces of different encodings, then the source is broken and you shouldn't expect much better results. You could try guessing, but you should expect mojibake in your results. http://en.wikipedia.org/wiki/Mojibake (3) Decode the web page into Unicode text, using the correct encoding. (4) Do all your processing in Unicode, not bytes. (5) Encode the text into bytes using UTF-8 encoding. (6) Write the bytes to a file. [...] Now I am trying to save this to a file but I never get the encoding right. Here is what I tried (+ lot's of different things with encode, decode...): outFile=codecs.open( filePath, w, UTF8 ) outFile.write(utext) outFile.close() That's the correct approach, but it won't help you if utext contains the wrong characters in the first place. The critical step is taking the bytes in the web page and turning them into text. How are you generating utext? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
On Thu, Oct 6, 2011 at 3:39 PM, Greg gregor.hochsch...@googlemail.com wrote: Brilliant! It worked. Thanks! Here is the final code for those who are struggling with similar problems: ## open and decode file # In this case, the encoding comes from the charset argument in a meta tag # e.g. meta charset=iso-8859-2 fileContent = fileObj.decode(iso-8859-2) f.write(text.encode('utf-8')) In other words, when you decode correctly into Unicode and encode correctly onto the disk, it works! This is why encodings are so important :) ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Encoding problem when launching Python27 via DOS
Thanks a lot for this quick answer! It is very clear! Ti better understand what the difference between encoding and decoding is I found the following website: http://www.evanjones.ca/python-utf8.html http://www.evanjones.ca/python-utf8.htmlI change the program to (changes are in bold): *# -*- coding: utf8 -*- *(no more cp1252 the source file is directly in unicode) *#!/usr/bin/python* *'''* *Created on 27 déc. 2010* * * *@author: jpmena* *'''* *from datetime import datetime* *import locale* *import codecs* *import os,sys* * * *class Log(object):* *log=None* *def __init__(self,log_path):* *self.log_path=log_path* *if(os.path.exists(self.log_path)):* *os.remove(self.log_path)* *#self.log=open(self.log_path,'a')* *self.log=codecs.open(self.log_path, a, 'utf-8')* ** *def getInstance(log_path=None):* *print encodage systeme:+sys.getdefaultencoding()* *if Log.log is None:* *if log_path is None:* *log_path=os.path.join(os.getcwd(),'logParDefaut.log')* *Log.log=Log(log_path)* *return Log.log* ** *getInstance=staticmethod(getInstance)* ** *def p(self,msg):* *aujour_dhui=datetime.now()* *date_stamp=aujour_dhui.strftime(%d/%m/%y-%H:%M:%S)* *print sys.getdefaultencoding()* *unicode_str='%s : %s \n' % (date_stamp,unicode(msg,'utf-8'))* *#unicode_str=msg* *self.log.write(unicode_str)* *return unicode_str* ** *def close(self):* *self.log.flush()* *self.log.close()* *return self.log_path* * * *if __name__ == '__main__':* *l=Log.getInstance()* *l.p(premier message de Log à accents)* *Log.getInstance().p(second message de Log)* *l.close()* The DOS conole output is now: *C:\Documents and Settings\jpmena\Mes documents\VelocityRIF\VelocityTransformsgenerationProgrammeSitePublicActuel.cmd * *Page de codes active : 1252* *encodage systeme:ascii* *ascii* *encodage systeme:ascii* *ascii* And the Generated Log file showsnow the expected result: *11/04/11-10:53:44 : premier message de Log à accents * *11/04/11-10:53:44 : second message de Log* Thanks. If you have other links of interests about unicode encoding and decoding in Python. They are welcome 2011/4/10 MRAB pyt...@mrabarnett.plus.com On 10/04/2011 13:22, Jean-Pierre M wrote: I created a simple program which writes in a unicode files some french text with accents! [snip] This line: l.p(premier message de Log à accents) passes a bytestring to the method, and inside the method, this line: unicode_str=u'%s : %s \n' % (date_stamp,msg.encode(self.charset_log,'replace')) it tries to encode the bytestring to Unicode. It's not possible to encode a bytestring, only a Unicode string, so Python tries to decode the bytestring using the fallback encoding (ASCII) and then encode the result. Unfortunately, the bytestring isn't ASCII (it contains accented characters), so it can't be decoded as ASCII, hence the exception. BTW, it's probably better to forget about cp1252, etc, and use UTF-8 instead, and also to use Unicode wherever possible. -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Encoding problem when launching Python27 via DOS
On 10/04/2011 13:22, Jean-Pierre M wrote: I created a simple program which writes in a unicode files some french text with accents! [snip] This line: l.p(premier message de Log à accents) passes a bytestring to the method, and inside the method, this line: unicode_str=u'%s : %s \n' % (date_stamp,msg.encode(self.charset_log,'replace')) it tries to encode the bytestring to Unicode. It's not possible to encode a bytestring, only a Unicode string, so Python tries to decode the bytestring using the fallback encoding (ASCII) and then encode the result. Unfortunately, the bytestring isn't ASCII (it contains accented characters), so it can't be decoded as ASCII, hence the exception. BTW, it's probably better to forget about cp1252, etc, and use UTF-8 instead, and also to use Unicode wherever possible. -- http://mail.python.org/mailman/listinfo/python-list
Re: Encoding problem - or bug in couchdb-0.8-py2.7.egg??
Ian Hobson i...@ianhobson.co.uk writes: Hi all, I have hit a problem and I don't know enough about python to diagnose things further. Trying to use couchDB from Python. This script:- # coding=utf8 import couchdb from couchdb.client import Server server = Server() dbName = 'python-tests' try: db = server.create(dbName) except couchdb.PreconditionFailed: del server[dbName] db = server.create(dbName) doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'}) Gives this traceback:- D:\work\C-U-Bpython tes1.py Traceback (most recent call last): File tes1.py, line 11, in module doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'}) File C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\client.py, line 407, in save _, _, data = func(body=doc, **options) File C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py, line 399, in post_json status, headers, data = self.post(*a, **k) File C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py, line 381, in post **params) File C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py, line 419, in _request credentials=self.credentials) File C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py, line 310, in request raise ServerError((status, error)) couchdb.http.ServerError: (400, ('bad_request', 'invalid UTF-8 JSON')) D:\work\C-U-B Why? I've tried adding u to the strings, and removing the # coding line, and I still get the same error. Sounds cargo-cultish. I suggest you read the python introduction on unicode. http://docs.python.org/howto/unicode.html For your actual problem, I have difficulties seeing how it can happen with the above data - frankly because there is nothing outside the ascii-range of data, so there is no reason why anything could be wrong encoded. But googling the error-message reveals that there seem to be totally unrelated reasons for this: http://sindro.me/2010/4/3/couchdb-invalid-utf8-json Maybe using something like tcpmon or ethereal to capture the actual HTTP-request helps to see where the issue comes from. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: Encoding problem - or bug in couchdb-0.8-py2.7.egg??
Thanks Diez, Removing, rebooting and installing the latest version solved the problem. :) Your google-foo is better than mine. Google had not turned that up for me. Thanks again Regards Ian On 20/09/2010 17:00, Diez B. Roggisch wrote: Ian Hobsoni...@ianhobson.co.uk writes: Hi all, I have hit a problem and I don't know enough about python to diagnose things further. Trying to use couchDB from Python. This script:- # coding=utf8 import couchdb from couchdb.client import Server server = Server() dbName = 'python-tests' try: db = server.create(dbName) except couchdb.PreconditionFailed: del server[dbName] db = server.create(dbName) doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'}) Gives this traceback:- D:\work\C-U-Bpython tes1.py Traceback (most recent call last): File tes1.py, line 11, inmodule doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'}) File C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\client.py, line 407, in save _, _, data = func(body=doc, **options) File C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py, line 399, in post_json status, headers, data = self.post(*a, **k) File C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py, line 381, in post **params) File C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py, line 419, in _request credentials=self.credentials) File C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py, line 310, in request raise ServerError((status, error)) couchdb.http.ServerError: (400, ('bad_request', 'invalid UTF-8 JSON')) D:\work\C-U-B Why? I've tried adding u to the strings, and removing the # coding line, and I still get the same error. Sounds cargo-cultish. I suggest you read the python introduction on unicode. http://docs.python.org/howto/unicode.html For your actual problem, I have difficulties seeing how it can happen with the above data - frankly because there is nothing outside the ascii-range of data, so there is no reason why anything could be wrong encoded. I came to the same conclusion. But googling the error-message reveals that there seem to be totally unrelated reasons for this: http://sindro.me/2010/4/3/couchdb-invalid-utf8-json Maybe using something like tcpmon or ethereal to capture the actual HTTP-request helps to see where the issue comes from. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
netpork todorovic.de...@gmail.com (n) wrote: n Hello, n I have ssl socket with server and client, on my development machine n everything works pretty well. n Database which I have to use is mssql on ms server 2003, so I decided n to install the same python config there and run my python server n script. n Now here is the problem, server is returning strange characters n although default encoding is the same on both development and server n machines. n Any hints? Yes, read http://catb.org/esr/faqs/smart-questions.html -- Piet van Oostrum p...@cs.uu.nl URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
It was problem with pymssql that not supports unicode, switched to pyodbc, everything is fine. Thanks for your swift reply. ;) On Jun 27, 7:44 pm, Piet van Oostrum p...@cs.uu.nl wrote: netpork todorovic.de...@gmail.com (n) wrote: n Hello, n I have ssl socket with server and client, on my development machine n everything works pretty well. n Database which I have to use is mssql on ms server 2003, so I decided n to install the same python config there and run my python server n script. n Now here is the problem, server is returning strange characters n although default encoding is the same on both development and server n machines. n Any hints? Yes, readhttp://catb.org/esr/faqs/smart-questions.html -- Piet van Oostrum p...@cs.uu.nl URL:http://pietvanoostrum.com[PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
On Fri, 19 Dec 2008 16:50:39 -0700, Joe Strout wrote: Marc 'BlackJack' Rintsch wrote: And does REALbasic really use byte strings plus an encoding!? You betcha! Works like a dream. IMHO a strange design decision. I get that you don't grok it, but I think that's because you haven't worked with it. RB added encoding data to its strings years ago, and changed the default string encoding to UTF-8 at about the same time, and life has been delightful since then. The only time you ever have to think about it is when you're importing a string from some unknown source (e.g. a socket), at which point you need to tell RB what encoding it is. From that point on, you can pass that string around, extract substrings, split it into words, concatenate it with other strings, etc., and it all Just Works (tm). Except that you don't know for sure what the output encoding will be, as it depends on the operations on the strings in the program flow. So to be sure you have to en- or recode at output too. And then it is the same as in Python -- decode when bytes enter the program and encode when (unicode) strings leave the program. In comparison, Python requires a lot more thought on the part of the programmer to keep track of what's what (unless, as you point out, you convert everything into unicode strings as soon as you get them, but that can be a very expensive operation to do on, say, a 500MB UTF-8 text file). So it doesn't require more thought. Unless you complicate it yourself, but that is language independent. I would not do operations on 500 MiB text in any language if there is any way to break that down into smaller chunks. Slurping in large files doesn't scale very well. On my Eee-PC even a 500 MiB byte `str` is (too) expensive. But saying that having only one string type that knows it's Unicode, and another string type that hasn't the foggiest clue how to interpret its data as text, is somehow easier than every string knowing what it is and doing the right thing -- well, that's just silly. Sorry, I meant the implementation not the POV of the programmer, which seems to be quite the same. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
digisat...@gmail.com a écrit : The below snippet code generates UnicodeDecodeError. #!/usr/bin/env python #--*-- coding: utf-8 --*-- s = 'äöü' u = unicode(s) It seems that the system use the default encoding- ASCII to decode the utf8 encoded string literal, and thus generates the error. Indeed. You want: u = unicode(s, 'utf-8') # or : u = s.decode('utf-8') The question is why the Python interpreter use the default encoding instead of utf-8, which I explicitly declared in the source. Because there's no reliable way for the interpreter to guess how what's passed to unicode has been encoded ? s = s.decode(utf-8).encode(latin1) # should unicode try to use utf-8 here ? try: u = unicode(s) except UnicodeDecodeError: print would have worked better with u = unicode(s, 'latin1') NB : IIRC, the ascii subset is safe whatever the encoding, so I'd say it's a sensible default... -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
On Fri, 19 Dec 2008 04:05:12 -0800, digisat...@gmail.com wrote: The below snippet code generates UnicodeDecodeError. #!/usr/bin/env python #--*-- coding: utf-8 --*-- s = 'äöü' u = unicode(s) It seems that the system use the default encoding- ASCII to decode the utf8 encoded string literal, and thus generates the error. The question is why the Python interpreter use the default encoding instead of utf-8, which I explicitly declared in the source. Because the declaration is only for decoding unicode literals in that very source file. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
Marc 'BlackJack' Rintsch wrote: The question is why the Python interpreter use the default encoding instead of utf-8, which I explicitly declared in the source. Because the declaration is only for decoding unicode literals in that very source file. And because strings in Python, unlike in (say) REALbasic, do not know their encoding -- they're just a string of bytes. If they were a string of bytes PLUS an encoding, then every string would know what it is, and things like conversion to another encoding, or concatenation of two strings that may differ in encoding, could be handled automatically. I consider this one of the great shortcomings of Python, but it's mostly just a temporary inconvenience -- the world is moving to Unicode, and with Python 3, we won't have to worry about it so much. Best, - Joe -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
On 12月19日, 下午9时34分, Marc 'BlackJack' Rintsch bj_...@gmx.net wrote: On Fri, 19 Dec 2008 04:05:12 -0800, digisat...@gmail.com wrote: The below snippet code generates UnicodeDecodeError. #!/usr/bin/env python #--*-- coding: utf-8 --*-- s = 'äöü' u = unicode(s) It seems that the system use the default encoding- ASCII to decode the utf8 encoded string literal, and thus generates the error. The question is why the Python interpreter use the default encoding instead of utf-8, which I explicitly declared in the source. Because the declaration is only for decoding unicode literals in that very source file. Ciao, Marc 'BlackJack' Rintsch Thanks for the answer. I believe the declaration is not only for unicode literals, it is for all literals in the source even including Comments. we can try runing a source file without encoding declaration and have only 1 line of comments with non-ASCII characters. That will arise a Syntax error and bring me to the pep263 URL. I read the pep263 and quoted below: Python's tokenizer/compiler combo will need to be updated to work as follows: 1. read the file 2. decode it into Unicode assuming a fixed per-file encoding 3. convert it into a UTF-8 byte string 4. tokenize the UTF-8 content 5. compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding The above described Python internal process indicate that the step 2 will utilise the specific encoding to decode all literals in source, while in step5 will evolve a re-encoding with the specific encoding. That is the reason why we have to explicitly declare a encoding as long as we have non-ASCII in source. Bruno answered why we need specify a encoding when decoding a byte string with perfect explanation, Thank you very much. -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
On Fri, 19 Dec 2008 08:20:07 -0700, Joe Strout wrote: Marc 'BlackJack' Rintsch wrote: The question is why the Python interpreter use the default encoding instead of utf-8, which I explicitly declared in the source. Because the declaration is only for decoding unicode literals in that very source file. And because strings in Python, unlike in (say) REALbasic, do not know their encoding -- they're just a string of bytes. If they were a string of bytes PLUS an encoding, then every string would know what it is, and things like conversion to another encoding, or concatenation of two strings that may differ in encoding, could be handled automatically. I consider this one of the great shortcomings of Python, but it's mostly just a temporary inconvenience -- the world is moving to Unicode, and with Python 3, we won't have to worry about it so much. I don't see the shortcoming in Python 3.0. If you want real strings with characters instead of just a bunch of bytes simply use `unicode` objects instead of `str`. And does REALbasic really use byte strings plus an encoding!? Sounds strange. When concatenating which encoding wins? Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
Marc 'BlackJack' Rintsch wrote: And because strings in Python, unlike in (say) REALbasic, do not know their encoding -- they're just a string of bytes. If they were a string of bytes PLUS an encoding, then every string would know what it is, and things like conversion to another encoding, or concatenation of two strings that may differ in encoding, could be handled automatically. I consider this one of the great shortcomings of Python, but it's mostly just a temporary inconvenience -- the world is moving to Unicode, and with Python 3, we won't have to worry about it so much. I don't see the shortcoming in Python 3.0. If you want real strings with characters instead of just a bunch of bytes simply use `unicode` objects instead of `str`. Fair enough -- that certainly is the best policy. But working with any other encoding (sometimes necessary when interfacing with any other software), it's still a bit of a PITA. And does REALbasic really use byte strings plus an encoding!? You betcha! Works like a dream. Sounds strange. When concatenating which encoding wins? The one that is a superset of the other, or if neither is, then both are converted to UTF-8 (which is the standard encoding in RB, though it works comfily with any other too). Cheers, - Joe -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
On Fri, 19 Dec 2008 15:20:08 -0700, Joe Strout wrote: Marc 'BlackJack' Rintsch wrote: And because strings in Python, unlike in (say) REALbasic, do not know their encoding -- they're just a string of bytes. If they were a string of bytes PLUS an encoding, then every string would know what it is, and things like conversion to another encoding, or concatenation of two strings that may differ in encoding, could be handled automatically. I consider this one of the great shortcomings of Python, but it's mostly just a temporary inconvenience -- the world is moving to Unicode, and with Python 3, we won't have to worry about it so much. I don't see the shortcoming in Python 3.0. If you want real strings with characters instead of just a bunch of bytes simply use `unicode` objects instead of `str`. Fair enough -- that certainly is the best policy. But working with any other encoding (sometimes necessary when interfacing with any other software), it's still a bit of a PITA. But it has to be. There is no automagic guessing possible. And does REALbasic really use byte strings plus an encoding!? You betcha! Works like a dream. IMHO a strange design decision. A lot more hassle compared to an opaque unicode string type which uses some internal encoding that makes operations like getting a character at a given index easy or concatenating without the need to reencode. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
On Dec 20, 10:02 am, Marc 'BlackJack' Rintsch bj_...@gmx.net wrote: On Fri, 19 Dec 2008 15:20:08 -0700, Joe Strout wrote: Marc 'BlackJack' Rintsch wrote: And because strings in Python, unlike in (say) REALbasic, do not know their encoding -- they're just a string of bytes. If they were a string of bytes PLUS an encoding, then every string would know what it is, and things like conversion to another encoding, or concatenation of two strings that may differ in encoding, could be handled automatically. I consider this one of the great shortcomings of Python, but it's mostly just a temporary inconvenience -- the world is moving to Unicode, and with Python 3, we won't have to worry about it so much. I don't see the shortcoming in Python 3.0. If you want real strings with characters instead of just a bunch of bytes simply use `unicode` objects instead of `str`. Fair enough -- that certainly is the best policy. But working with any other encoding (sometimes necessary when interfacing with any other software), it's still a bit of a PITA. But it has to be. There is no automagic guessing possible. And does REALbasic really use byte strings plus an encoding!? You betcha! Works like a dream. IMHO a strange design decision. A lot more hassle compared to an opaque unicode string type which uses some internal encoding that makes operations like getting a character at a given index easy or concatenating without the need to reencode. In general I quite agree with you ... hoever with Unicode getting a character at a given index is fine unless and until you stray (or are dragged!) outside the BMP and you have only a 16-bit Unicode implementation. -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
Marc 'BlackJack' Rintsch wrote: I don't see the shortcoming in Python 3.0. If you want real strings with characters instead of just a bunch of bytes simply use `unicode` objects instead of `str`. Fair enough -- that certainly is the best policy. But working with any other encoding (sometimes necessary when interfacing with any other software), it's still a bit of a PITA. But it has to be. There is no automagic guessing possible. Automagic guessing isn't possible if strings keep track of what encoding their data is. And why shouldn't they? We're a long way from the day when a string was nothing more than an array of bytes. Adding a teeny bit of metadata makes life much easier. And does REALbasic really use byte strings plus an encoding!? You betcha! Works like a dream. IMHO a strange design decision. I get that you don't grok it, but I think that's because you haven't worked with it. RB added encoding data to its strings years ago, and changed the default string encoding to UTF-8 at about the same time, and life has been delightful since then. The only time you ever have to think about it is when you're importing a string from some unknown source (e.g. a socket), at which point you need to tell RB what encoding it is. From that point on, you can pass that string around, extract substrings, split it into words, concatenate it with other strings, etc., and it all Just Works (tm). In comparison, Python requires a lot more thought on the part of the programmer to keep track of what's what (unless, as you point out, you convert everything into unicode strings as soon as you get them, but that can be a very expensive operation to do on, say, a 500MB UTF-8 text file). A lot more hassle compared to an opaque unicode string type which uses some internal encoding that makes operations like getting a character at a given index easy or concatenating without the need to reencode. No. RB supports UCS-2 encoding, too, and is smart enough to take advantage of the fixed character width of any encoding when that's what a string happens to be. And no reencoding is used when it's not necessary (e.g., concatenating two strings of the same encoding, or adding an ASCII string to a string using any ASCII superset, such as UTF-8). There's nothing stopping you from converting all your strings to UCS-2 when you get them, if that's your preference. But saying that having only one string type that knows it's Unicode, and another string type that hasn't the foggiest clue how to interpret its data as text, is somehow easier than every string knowing what it is and doing the right thing -- well, that's just silly. Best, - Joe -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
On May 16, 3:31 pm, Luis Zarrabeitia [EMAIL PROTECTED] wrote: Hi, guys. I'm trying to read an xml file and output some of the nodes. For that, I'm doing a print node.toprettyxml() However, I get this exception: === out.write(tag.toxml()) UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in position 190: ordinal not in range(128) === That happens if I print it, or send it to stdout, or send it to a file. How can I fix it? cat file works perfectly, and I'm using an utf8 terminal. I'm particularly puzzled that it won't work even if I write to a file opened in b mode. Worst thing is... I don't really need that character, just a general idea of how the document looks like. -- Luis Zarrabeitia (aka Kyrie) Fac. de Matemática y Computación, UH.http://profesores.matcom.uh.cu/~kyrie I recommend studying up on Python's Unicode methods and the codecs module. This site actually talks about your specific issue though and gives pointers: http://evanjones.ca/python-utf8.html HTH Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: Encoding problem with web application (Paste+Mako)
[EMAIL PROTECTED] wrote: Hi I have a problem with encoding non-ascii characters in a web application. The application uses Paste and Mako. The code is here: http://www.webudkast.dk/demo.txt The main points are: After getting some user generated input using paste.request.parse_formvars, how should this be correctly saved to file? How should this afterward be read from the file, and fed correctly into a Mako template? You have to know the encoding of user input and then you can use ``input_encoding`` and ``output_encoding`` parameters of ``Template``. Mako internally handles everything as Python unicode objects. For example: t = Template(filename=templ.mako, input_encoding=iso-8859-2, output_encoding=iso-8859-2) content = t.render(**context) -- HTH, Rob -- http://mail.python.org/mailman/listinfo/python-list
Re: Encoding problem with web application (Paste+Mako)
Rob Wolfe wrote: You have to know the encoding of user input and then you can use ``input_encoding`` and ``output_encoding`` parameters of ``Template``. Mako internally handles everything as Python unicode objects. For example: t = Template(filename=templ.mako, input_encoding=iso-8859-2, output_encoding=iso-8859-2) content = t.render(**context) -- HTH, Rob Thanks Rob Using: t=Template(content,input_encoding=utf-8, output_encoding=utf-8) did the trick. Thanks for the help. /Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
Yves Glodt wrote: It seems in general I have trouble with special characters... What is the python way to deal with éàè öäü etc... print 'é' fails here, print u'é' as well :-( How am I supposed to print non-ascii characters the correct way? The second form should be used, but not in interactive mode. In a Python script, make sure you properly declare the encoding of your script, e.g. # -*- coding: iso-8859-1 -*- print u'é' That should work. If not, give us your Python version, operating system name, and mode of operation. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
Yves Glodt wrote: It seems in general I have trouble with special characters... What is the python way to deal with éàè öäü etc... print 'é' fails here, This should probably stay true. print u'é' as well :-( This is an issue with how your output is connected. What OS, what code page, what application? I'm using Win2K, Python 2.4.2 Using Idle, I can do: print u'élève' And get what I expect I can also do: print repr(u'élève') which gives me: u'\xe9l\xe8ve' and: print u'\xe9l\xe8ve' Also shows me élève. With cmd.exe (the command line): c:\ python print u'\xe9l\xe8ve' shows me élève, but I can't type in: print u'lve' is what I get when I paste in the print u'élève' (beeps during paste). What do you get if you put in: print repr('élève') --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
Sebastjan Trepca wrote: I think you are trying to concatenate a unicode string with regular one so when it tries to convert the regular string to unicode with ASCII(default one) encoding it fails. First find out which of these strings is regular and how it was encoded, then you can decode it like this(if regular string is diff): mailbody +=diff.decode('correct encoding') Thanks I'll look into that... It seems in general I have trouble with special characters... What is the python way to deal with éàè öäü etc... print 'é' fails here, print u'é' as well :-( How am I supposed to print non-ascii characters the correct way? best regards, Yves Sebastjan On 3/3/06, Yves Glodt [EMAIL PROTECTED] wrote: Hi list, Playing with the great pysvn I get this problem: Traceback (most recent call last): File D:\avn\mail.py, line 80, in ? mailbody += diff UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 10710: ordinal not in range(128) It seems the pysvn.client.diff function returns bytes (as I read in the changelog of pysvn: http://svn.haxx.se/dev/archive-2005-10/0466.shtml) How can I convert this string so that I can contatenate it to my regular string? Best regards, Yves -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem
I think you are trying to concatenate a unicode string with regular one so when it tries to convert the regular string to unicode with ASCII(default one) encoding it fails. First find out which of these strings is regular and how it was encoded, then you can decode it like this(if regular string is diff): mailbody +=diff.decode('correct encoding') Sebastjan On 3/3/06, Yves Glodt [EMAIL PROTECTED] wrote: Hi list, Playing with the great pysvn I get this problem: Traceback (most recent call last): File D:\avn\mail.py, line 80, in ? mailbody += diff UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 10710: ordinal not in range(128) It seems the pysvn.client.diff function returns bytes (as I read in the changelog of pysvn: http://svn.haxx.se/dev/archive-2005-10/0466.shtml) How can I convert this string so that I can contatenate it to my regular string? Best regards, Yves -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list