Re: Encoding problem in python

2013-08-21 Thread electron
If you use Arabic frequently on your system, I suggest to change your
windows system locale from Region and Language in control panel
(Administrative tab) and set to Arabic.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem in python

2013-03-04 Thread Laszlo Nagy

On 2013-03-04 10:37, yomnasala...@gmail.com wrote:

I have a problem with encoding in python 27 shell.

when i write this in the python shell:

w=u'العربى'

It gives me the following error:

Unsupported characters in input

any help?
Maybe it is not Python related. Did you get an exception? Can you send a 
full traceback? I suspect that the error comes from your terminal, and 
not Python. Please make sure that your terminal supports UTF-8 encoding. 
Alternatively, try creating a file with this content:



# -*- encoding: UTF-8 -*-
w=u'العربى'

Save it as UTF-8 encoded file test.py (with an UTF-8 compatible 
editor, for example Geany) and run it as a command:



python test.py

If it works then it is sure that the problem is with your terminal. It 
will be an OS limitation, not Python's limitation.


Best,

   Laszlo
--
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem in python

2013-03-04 Thread Steven D'Aprano
On Mon, 04 Mar 2013 01:37:42 -0800, yomnasalah91 wrote:

 I have a problem with encoding in python 27 shell.
 
 when i write this in the python shell:
 
 w=u'العربى'
 
 It gives me the following error:
 
 Unsupported characters in input
 
 any help?

Firstly, please show the COMPLETE error, including the full traceback. 
Python errors look like (for example):

py x = ord(100)
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: ord() expected string of length 1, but int found


Copy and paste the complete traceback.


Secondly, please describe your environment:

- What operating system and version are you using? Linux, Windows, Mac 
OS, something else? Which version or distro?

- Which console or terminal application? E.g. cmd.exe (Windows), konsole, 
xterm, something else?

- Which shell? E.g. the standard Python interpreter, IDLE, bpython, 
something else?


My guess is that this is not a Python problem, but an issue with your 
console. You should always have your console set to use UTF-8, if 
possible. I expect that your console is set to use a different encoding. 
In that case, see if you can change it to UTF-8. For example, using Gnome 
Terminal on Linux, I can do this:


py w = u'العربى'
py print w
العربى

and it works fine, but if I change the encoding to WINDOWS-1252 using the 
Set character encoding menu command, the terminal will not allow me to 
paste the string into the terminal. 



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem in python

2013-03-04 Thread Vlastimil Brom
2013/3/4  yomnasala...@gmail.com:
 I have a problem with encoding in python 27 shell.

 when i write this in the python shell:

 w=u'العربى'

 It gives me the following error:

 Unsupported characters in input

 any help?
 --
 http://mail.python.org/mailman/listinfo/python-list


Hi,
I guess, you are using the built-in IDLE shell with python 2.7 and
this is a specific limitation of its handling of some unicode
characters  (in some builds and OSes - narrow-unicode, Windows, most
likely?) and its specific error message - not the usual python
traceback mentioned in other posts).
If it is viable, using python 3.3 instead would solve this problem for IDLE:

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win32
Type copyright, credits or license() for more information.
 w='العربى'
 w
'العربى'

(note the missing u in unicode literal before the starting quotation
mark, which would be the usual usage in python 3, but python 3.3 also
silently ignores u... for compatibility.)

 w=u'العربى'
 w
'العربى'


If python 2.7 is required, another shell is probably needed (unless I
am missing some option to make IDLE work for this input);
e.g. the following works in pyshell - part of the wxpython GUI library
http://www.wxpython.org/

 w=u'العربى'
 w
u'\u0627\u0644\u0639\u0631\u0628\u0649'
 print w
العربى


hth,
   vbr
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-08 Thread Nobody
On Wed, 05 Oct 2011 21:39:17 -0700, Greg wrote:

 Here is the final code for those who are struggling with similar
 problems:
 
 ## open and decode file
 # In this case, the encoding comes from the charset argument in a meta
 tag
 # e.g. meta charset=iso-8859-2
 fileObj = open(filePath,r).read()
 fileContent = fileObj.decode(iso-8859-2)
 fileSoup = BeautifulSoup(fileContent)

The fileObj.decode() step should be unnecessary, and is usually
undesirable; Beautiful Soup should be doing the decoding itself.

If you actually know the encoding (e.g. from a Content-Type header), you
can specify it via the fromEncoding parameter to the BeautifulSoup
constructor, e.g.:

fileSoup = BeautifulSoup(fileObj.read(), fromEncoding=iso-8859-2)

If you don't specify the encoding, it will be deduced from a meta tag if
one is present, or a Unicode BOM, or using the chardet library if
available, or using built-in heuristics, before finally falling back to
Windows-1252 (which seems to be the preferred encoding of people who don't
understand what an encoding is or why it needs to be specified).

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread Ulrich Eckhardt

Am 06.10.2011 05:40, schrieb Steven D'Aprano:

(4) Do all your processing in Unicode, not bytes.

(5) Encode the text into bytes using UTF-8 encoding.

(6) Write the bytes to a file.


Just wondering, why do you split the latter two parts? I would have used 
codecs.open() to open the file and define the encoding in a single step. 
Is there a downside to this approach?


Otherwise, I can only confirm that your overall approach is the easiest 
way to get correct results.


Uli
--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread Chris Angelico
On Thu, Oct 6, 2011 at 8:29 PM, Ulrich Eckhardt
ulrich.eckha...@dominalaser.com wrote:
 Just wondering, why do you split the latter two parts? I would have used
 codecs.open() to open the file and define the encoding in a single step. Is
 there a downside to this approach?


Those two steps still happen, even if you achieve them in a single
function call. What Steven described is language- and library-
independent.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread jmfauth
On 6 oct, 06:39, Greg gregor.hochsch...@googlemail.com wrote:
 Brilliant! It worked. Thanks!

 Here is the final code for those who are struggling with similar
 problems:

 ## open and decode file
 # In this case, the encoding comes from the charset argument in a meta
 tag
 # e.g. meta charset=iso-8859-2
 fileObj = open(filePath,r).read()
 fileContent = fileObj.decode(iso-8859-2)
 fileSoup = BeautifulSoup(fileContent)

 ## Do some BeautifulSoup magic and preserve unicode, presume result is
 saved in 'text' ##

 ## write extracted text to file
 f = open(outFilePath, 'w')
 f.write(text.encode('utf-8'))
 f.close()




or  (Python2/Python3)

 import io
 with io.open('abc.txt', 'r', encoding='iso-8859-2') as f:
... r = f.read()
...
 repr(r)
u'a\nb\nc\n'
 with io.open('def.txt', 'w', encoding='utf-8-sig') as f:
... t = f.write(r)
...
 f.closed
True

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread xDog Walker
On Thursday 2011 October 06 10:41, jmfauth wrote:
 or  (Python2/Python3)

  import io
  with io.open('abc.txt', 'r', encoding='iso-8859-2') as f:

 ...     r = f.read()
 ...

  repr(r)

 u'a\nb\nc\n'

  with io.open('def.txt', 'w', encoding='utf-8-sig') as f:

 ...     t = f.write(r)
 ...

  f.closed

 True

 jmf

What is this  io  of which you speak?

-- 
I have seen the future and I am not in it.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread John Gordon
In mailman.1785.1317928997.27778.python-l...@python.org xDog Walker 
thud...@gmail.com writes:

 What is this  io  of which you speak?

It was introduced in Python 2.6.

-- 
John Gordon   A is for Amy, who fell down the stairs
gor...@panix.com  B is for Basil, assaulted by bears
-- Edward Gorey, The Gashlycrumb Tinies

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-05 Thread Steven D'Aprano
On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote:

 Hi, I am having some encoding problems when I first parse stuff from a
 non-english website using BeautifulSoup and then write the results to a
 txt file.

If you haven't already read this, you should do so:

http://www.joelonsoftware.com/articles/Unicode.html



 I have the text both as a normal (text) and as a unicode string (utext):
 print repr(text)
 'Branie zak\xc2\xb3adnik\xc3\xb3w'

This is pretty much meaningless, because we don't know how you got the 
text and what it actually is. You're showing us a bunch of bytes, with no 
clue as to whether they are the right bytes or not. Considering that your 
Unicode text is also incorrect, I would say it is *not* right and your 
description of the problem is 100% backwards: the problem is not 
*writing* the text, but *reading* the bytes and decoding it.


You should do something like this:

(1) Inspect the web page to find out what encoding is actually used.

(2) If the web page doesn't know what encoding it uses, or if it uses 
bits and pieces of different encodings, then the source is broken and you 
shouldn't expect much better results. You could try guessing, but you 
should expect mojibake in your results.

http://en.wikipedia.org/wiki/Mojibake

(3) Decode the web page into Unicode text, using the correct encoding.

(4) Do all your processing in Unicode, not bytes.

(5) Encode the text into bytes using UTF-8 encoding.

(6) Write the bytes to a file.


[...]
 Now I am trying to save this to a file but I never get the encoding
 right. Here is what I tried (+ lot's of different things with encode,
 decode...):

 outFile=codecs.open( filePath, w, UTF8 ) 
 outFile.write(utext)
 outFile.close()

That's the correct approach, but it won't help you if utext contains the 
wrong characters in the first place. The critical step is taking the 
bytes in the web page and turning them into text.

How are you generating utext?



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-05 Thread Greg
Brilliant! It worked. Thanks!

Here is the final code for those who are struggling with similar
problems:

## open and decode file
# In this case, the encoding comes from the charset argument in a meta
tag
# e.g. meta charset=iso-8859-2
fileObj = open(filePath,r).read()
fileContent = fileObj.decode(iso-8859-2)
fileSoup = BeautifulSoup(fileContent)

## Do some BeautifulSoup magic and preserve unicode, presume result is
saved in 'text' ##

## write extracted text to file
f = open(outFilePath, 'w')
f.write(text.encode('utf-8'))
f.close()



On Oct 5, 11:40 pm, Steven D'Aprano steve
+comp.lang.pyt...@pearwood.info wrote:
 On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote:
  Hi, I am having some encoding problems when I first parse stuff from a
  non-english website using BeautifulSoup and then write the results to a
  txt file.

 If you haven't already read this, you should do so:

 http://www.joelonsoftware.com/articles/Unicode.html

  I have the text both as a normal (text) and as a unicode string (utext):
  print repr(text)
  'Branie zak\xc2\xb3adnik\xc3\xb3w'

 This is pretty much meaningless, because we don't know how you got the
 text and what it actually is. You're showing us a bunch of bytes, with no
 clue as to whether they are the right bytes or not. Considering that your
 Unicode text is also incorrect, I would say it is *not* right and your
 description of the problem is 100% backwards: the problem is not
 *writing* the text, but *reading* the bytes and decoding it.

 You should do something like this:

 (1) Inspect the web page to find out what encoding is actually used.

 (2) If the web page doesn't know what encoding it uses, or if it uses
 bits and pieces of different encodings, then the source is broken and you
 shouldn't expect much better results. You could try guessing, but you
 should expect mojibake in your results.

 http://en.wikipedia.org/wiki/Mojibake

 (3) Decode the web page into Unicode text, using the correct encoding.

 (4) Do all your processing in Unicode, not bytes.

 (5) Encode the text into bytes using UTF-8 encoding.

 (6) Write the bytes to a file.

 [...]

  Now I am trying to save this to a file but I never get the encoding
  right. Here is what I tried (+ lot's of different things with encode,
  decode...):
  outFile=codecs.open( filePath, w, UTF8 )
  outFile.write(utext)
  outFile.close()

 That's the correct approach, but it won't help you if utext contains the
 wrong characters in the first place. The critical step is taking the
 bytes in the web page and turning them into text.

 How are you generating utext?

 --
 Steven

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-05 Thread Chris Angelico
On Thu, Oct 6, 2011 at 3:39 PM, Greg gregor.hochsch...@googlemail.com wrote:
 Brilliant! It worked. Thanks!

 Here is the final code for those who are struggling with similar
 problems:

 ## open and decode file
 # In this case, the encoding comes from the charset argument in a meta
 tag
 # e.g. meta charset=iso-8859-2
 fileContent = fileObj.decode(iso-8859-2)
 f.write(text.encode('utf-8'))

In other words, when you decode correctly into Unicode and encode
correctly onto the disk, it works!

This is why encodings are so important :)

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem when launching Python27 via DOS

2011-04-11 Thread Jean-Pierre M
Thanks a lot for this quick answer! It is very clear!

Ti better understand what the difference between encoding and decoding is I
found the following website: http://www.evanjones.ca/python-utf8.html

http://www.evanjones.ca/python-utf8.htmlI change the program to (changes
are in bold):
*# -*- coding: utf8 -*- *(no more cp1252 the source file is directly in
unicode)
*#!/usr/bin/python*
*'''*
*Created on 27 déc. 2010*
*
*
*@author: jpmena*
*'''*
*from datetime import datetime*
*import locale*
*import codecs*
*import os,sys*
*
*
*class Log(object):*
*log=None*
*def __init__(self,log_path):*
*self.log_path=log_path*
*if(os.path.exists(self.log_path)):*
*os.remove(self.log_path)*
*#self.log=open(self.log_path,'a')*
*self.log=codecs.open(self.log_path, a, 'utf-8')*
**
*def getInstance(log_path=None):*
*print encodage systeme:+sys.getdefaultencoding()*
*if Log.log is None:*
*if log_path is None:*
*log_path=os.path.join(os.getcwd(),'logParDefaut.log')*
*Log.log=Log(log_path)*
*return Log.log*
**
*getInstance=staticmethod(getInstance)*
**
*def p(self,msg):*
*aujour_dhui=datetime.now()*
*date_stamp=aujour_dhui.strftime(%d/%m/%y-%H:%M:%S)*
*print sys.getdefaultencoding()*
*unicode_str='%s : %s \n'  % (date_stamp,unicode(msg,'utf-8'))*
*#unicode_str=msg*
*self.log.write(unicode_str)*
*return unicode_str*
**
*def close(self):*
*self.log.flush()*
*self.log.close()*
*return self.log_path*
*
*
*if __name__ == '__main__':*
*l=Log.getInstance()*
*l.p(premier message de Log à accents)*
*Log.getInstance().p(second message de Log)*
*l.close()*

The DOS conole output is now:
*C:\Documents and Settings\jpmena\Mes
documents\VelocityRIF\VelocityTransformsgenerationProgrammeSitePublicActuel.cmd
*
*Page de codes active : 1252*
*encodage systeme:ascii*
*ascii*
*encodage systeme:ascii*
*ascii*

And the Generated Log file showsnow the expected result:
*11/04/11-10:53:44 : premier message de Log à accents *
*11/04/11-10:53:44 : second message de Log*

Thanks.

If you have other links of interests about unicode encoding and decoding  in
Python. They are welcome

2011/4/10 MRAB pyt...@mrabarnett.plus.com

 On 10/04/2011 13:22, Jean-Pierre M wrote:
  I created a simple program which writes in a unicode files some french
 text with accents!
 [snip]
 This line:


l.p(premier message de Log à accents)

 passes a bytestring to the method, and inside the method, this line:


unicode_str=u'%s : %s \n'  %
 (date_stamp,msg.encode(self.charset_log,'replace'))

 it tries to encode the bytestring to Unicode.

 It's not possible to encode a bytestring, only a Unicode string, so
 Python tries to decode the bytestring using the fallback encoding
 (ASCII) and then encode the result.

 Unfortunately, the bytestring isn't ASCII (it contains accented
 characters), so it can't be decoded as ASCII, hence the exception.

 BTW, it's probably better to forget about cp1252, etc, and use UTF-8
 instead, and also to use Unicode wherever possible.
 --
 http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem when launching Python27 via DOS

2011-04-10 Thread MRAB

On 10/04/2011 13:22, Jean-Pierre M wrote:
 I created a simple program which writes in a unicode files some 
french text with accents!

[snip]
This line:

l.p(premier message de Log à accents)

passes a bytestring to the method, and inside the method, this line:

unicode_str=u'%s : %s \n'  % 
(date_stamp,msg.encode(self.charset_log,'replace'))


it tries to encode the bytestring to Unicode.

It's not possible to encode a bytestring, only a Unicode string, so
Python tries to decode the bytestring using the fallback encoding
(ASCII) and then encode the result.

Unfortunately, the bytestring isn't ASCII (it contains accented
characters), so it can't be decoded as ASCII, hence the exception.

BTW, it's probably better to forget about cp1252, etc, and use UTF-8
instead, and also to use Unicode wherever possible.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem - or bug in couchdb-0.8-py2.7.egg??

2010-09-20 Thread Diez B. Roggisch
Ian Hobson i...@ianhobson.co.uk writes:

 Hi all,

 I have hit a problem and I don't know enough about python to diagnose
 things further. Trying to use couchDB from Python. This script:-

 # coding=utf8
 import couchdb
 from couchdb.client import Server
 server = Server()
 dbName = 'python-tests'
 try:
 db = server.create(dbName)
 except couchdb.PreconditionFailed:
 del server[dbName]
 db = server.create(dbName)
 doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'})

 Gives this traceback:-

 D:\work\C-U-Bpython tes1.py
 Traceback (most recent call last):
   File tes1.py, line 11, in module
 doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'})
   File
 C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\client.py,
 line 407, in save
 _, _, data = func(body=doc, **options)
   File
 C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
 line 399, in post_json
 status, headers, data = self.post(*a, **k)
   File
 C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
 line 381, in post
 **params)
   File
 C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
 line 419, in _request
 credentials=self.credentials)
   File
 C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
 line 310, in request
 raise ServerError((status, error))
 couchdb.http.ServerError: (400, ('bad_request', 'invalid UTF-8 JSON'))

 D:\work\C-U-B

 Why? I've tried adding u to the strings, and removing the # coding
 line, and I still get the same error.

Sounds cargo-cultish. I suggest you read the python introduction on
unicode.

 http://docs.python.org/howto/unicode.html

For your actual problem, I have difficulties seeing how it can happen
with the above data - frankly because there is nothing outside the
ascii-range of data, so there is no reason why anything could be wrong
encoded.

But googling the error-message reveals that there seem to be totally
unrelated reasons for this:

  http://sindro.me/2010/4/3/couchdb-invalid-utf8-json

Maybe using something like tcpmon or ethereal to capture the actual
HTTP-request helps to see where the issue comes from.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem - or bug in couchdb-0.8-py2.7.egg??

2010-09-20 Thread Ian

 Thanks Diez,

Removing, rebooting and installing the latest version solved the 
problem.  :)


Your google-foo is better than mine.  Google had not turned that up for me.

Thanks again

Regards

Ian



On 20/09/2010 17:00, Diez B. Roggisch wrote:

Ian Hobsoni...@ianhobson.co.uk  writes:


Hi all,

I have hit a problem and I don't know enough about python to diagnose
things further. Trying to use couchDB from Python. This script:-

# coding=utf8
import couchdb
from couchdb.client import Server
server = Server()
dbName = 'python-tests'
try:
 db = server.create(dbName)
except couchdb.PreconditionFailed:
 del server[dbName]
 db = server.create(dbName)
doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'})

Gives this traceback:-

D:\work\C-U-Bpython tes1.py
Traceback (most recent call last):
   File tes1.py, line 11, inmodule
 doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'})
   File
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\client.py,
line 407, in save
 _, _, data = func(body=doc, **options)
   File
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
line 399, in post_json
 status, headers, data = self.post(*a, **k)
   File
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
line 381, in post
 **params)
   File
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
line 419, in _request
 credentials=self.credentials)
   File
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
line 310, in request
 raise ServerError((status, error))
couchdb.http.ServerError: (400, ('bad_request', 'invalid UTF-8 JSON'))

D:\work\C-U-B

Why? I've tried adding u to the strings, and removing the # coding
line, and I still get the same error.

Sounds cargo-cultish. I suggest you read the python introduction on
unicode.

  http://docs.python.org/howto/unicode.html

For your actual problem, I have difficulties seeing how it can happen
with the above data - frankly because there is nothing outside the
ascii-range of data, so there is no reason why anything could be wrong
encoded.

I came to the same conclusion.

But googling the error-message reveals that there seem to be totally
unrelated reasons for this:

   http://sindro.me/2010/4/3/couchdb-invalid-utf8-json

Maybe using something like tcpmon or ethereal to capture the actual
HTTP-request helps to see where the issue comes from.

Diez


--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2009-06-27 Thread Piet van Oostrum
 netpork todorovic.de...@gmail.com (n) wrote:

n Hello,
n I have ssl socket with server and client, on my development machine
n everything works pretty well.
n Database which I have to use is mssql on ms server 2003, so I decided
n to install the same python config there and run my python server
n script.

n Now here is the problem, server is returning strange characters
n although default encoding is the same on both development and server
n machines.


n Any hints?

Yes, read http://catb.org/esr/faqs/smart-questions.html
-- 
Piet van Oostrum p...@cs.uu.nl
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2009-06-27 Thread dejan todorović
It was problem with pymssql that not supports unicode, switched to
pyodbc, everything is fine.

Thanks for your swift reply. ;)



On Jun 27, 7:44 pm, Piet van Oostrum p...@cs.uu.nl wrote:
  netpork todorovic.de...@gmail.com (n) wrote:
 n Hello,
 n I have ssl socket with server and client, on my development machine
 n everything works pretty well.
 n Database which I have to use is mssql on ms server 2003, so I decided
 n to install the same python config there and run my python server
 n script.
 n Now here is the problem, server is returning strange characters
 n although default encoding is the same on both development and server
 n machines.
 n Any hints?

 Yes, readhttp://catb.org/esr/faqs/smart-questions.html
 --
 Piet van Oostrum p...@cs.uu.nl
 URL:http://pietvanoostrum.com[PGP 8DAE142BE17999C4]
 Private email: p...@vanoostrum.org

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2008-12-20 Thread Marc 'BlackJack' Rintsch
On Fri, 19 Dec 2008 16:50:39 -0700, Joe Strout wrote:

 Marc 'BlackJack' Rintsch wrote:
 
 And does REALbasic really use byte strings plus an encoding!?
 You betcha!  Works like a dream.
 
 IMHO a strange design decision.
 
 I get that you don't grok it, but I think that's because you haven't
 worked with it.  RB added encoding data to its strings years ago, and
 changed the default string encoding to UTF-8 at about the same time, and
 life has been delightful since then.  The only time you ever have to
 think about it is when you're importing a string from some unknown
 source (e.g. a socket), at which point you need to tell RB what encoding
 it is.  From that point on, you can pass that string around, extract
 substrings, split it into words, concatenate it with other strings,
 etc., and it all Just Works (tm).

Except that you don't know for sure what the output encoding will be, as 
it depends on the operations on the strings in the program flow.  So to 
be sure you have to en- or recode at output too.  And then it is the same 
as in Python -- decode when bytes enter the program and encode when 
(unicode) strings leave the program.

 In comparison, Python requires a lot more thought on the part of the
 programmer to keep track of what's what (unless, as you point out, you
 convert everything into unicode strings as soon as you get them, but
 that can be a very expensive operation to do on, say, a 500MB UTF-8 text
 file).

So it doesn't require more thought.  Unless you complicate it yourself, 
but that is language independent.

I would not do operations on 500 MiB text in any language if there is any 
way to break that down into smaller chunks.  Slurping in large files 
doesn't scale very well.  On my Eee-PC even a 500 MiB byte `str` is (too) 
expensive.

 But saying that having only one string type that knows it's Unicode, and
 another string type that hasn't the foggiest clue how to interpret its
 data as text, is somehow easier than every string knowing what it is and
 doing the right thing -- well, that's just silly.

Sorry, I meant the implementation not the POV of the programmer, which 
seems to be quite the same.

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2008-12-19 Thread Bruno Desthuilliers

digisat...@gmail.com a écrit :

The below snippet code generates UnicodeDecodeError.
#!/usr/bin/env python
#--*-- coding: utf-8 --*--
s = 'äöü'
u = unicode(s)


It seems that the system use the default encoding- ASCII to decode the
utf8 encoded string literal, and thus generates the error.


Indeed. You want:

u = unicode(s, 'utf-8') # or : u = s.decode('utf-8')


The question is why the Python interpreter use the default encoding
instead of utf-8, which I explicitly declared in the source.


Because there's no reliable way for the interpreter to guess how what's 
passed to unicode has been encoded ?


s = s.decode(utf-8).encode(latin1)
# should unicode try to use utf-8 here ?
try:
  u = unicode(s)
except UnicodeDecodeError:
  print would have worked better with u = unicode(s, 'latin1')


NB : IIRC, the ascii subset is safe whatever the encoding, so I'd say 
it's a sensible default...

--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2008-12-19 Thread Marc 'BlackJack' Rintsch
On Fri, 19 Dec 2008 04:05:12 -0800, digisat...@gmail.com wrote:

 The below snippet code generates UnicodeDecodeError.
 #!/usr/bin/env
 python
 #--*-- coding: utf-8 --*--
 s = 'äöü'
 u = unicode(s)
 
 
 It seems that the system use the default encoding- ASCII to decode the
 utf8 encoded string literal, and thus generates the error.
 
 The question is why the Python interpreter use the default encoding
 instead of utf-8, which I explicitly declared in the source.

Because the declaration is only for decoding unicode literals in that 
very source file.

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2008-12-19 Thread Joe Strout

Marc 'BlackJack' Rintsch wrote:


The question is why the Python interpreter use the default encoding
instead of utf-8, which I explicitly declared in the source.


Because the declaration is only for decoding unicode literals in that 
very source file.


And because strings in Python, unlike in (say) REALbasic, do not know 
their encoding -- they're just a string of bytes.  If they were a string 
of bytes PLUS an encoding, then every string would know what it is, and 
things like conversion to another encoding, or concatenation of two 
strings that may differ in encoding, could be handled automatically.


I consider this one of the great shortcomings of Python, but it's mostly 
just a temporary inconvenience -- the world is moving to Unicode, and 
with Python 3, we won't have to worry about it so much.


Best,
- Joe




--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2008-12-19 Thread digisat...@gmail.com
On 12月19日, 下午9时34分, Marc 'BlackJack' Rintsch bj_...@gmx.net wrote:
 On Fri, 19 Dec 2008 04:05:12 -0800, digisat...@gmail.com wrote:
  The below snippet code generates UnicodeDecodeError.
  #!/usr/bin/env
  python
  #--*-- coding: utf-8 --*--
  s = 'äöü'
  u = unicode(s)

  It seems that the system use the default encoding- ASCII to decode the
  utf8 encoded string literal, and thus generates the error.

  The question is why the Python interpreter use the default encoding
  instead of utf-8, which I explicitly declared in the source.

 Because the declaration is only for decoding unicode literals in that
 very source file.

 Ciao,
         Marc 'BlackJack' Rintsch

Thanks for the answer.
I believe the declaration is not only for unicode literals, it is for
all literals in the source even including Comments. we can try runing
a source file without encoding declaration and have only 1 line of
comments with non-ASCII characters. That will arise a Syntax error and
bring me to the pep263 URL.

I read the pep263 and quoted below:

 Python's tokenizer/compiler combo will need to be updated to work as
follows:
   1. read the file
   2. decode it into Unicode assuming a fixed per-file encoding
   3. convert it into a UTF-8 byte string
   4. tokenize the UTF-8 content
   5. compile it, creating Unicode objects from the given Unicode
data
  and creating string objects from the Unicode literal data
  by first reencoding the UTF-8 data into 8-bit string data
  using the given file encoding

The above described Python internal process indicate that the step 2
will utilise the specific encoding to decode all literals in source,
while in step5 will evolve a re-encoding with the specific encoding.

That is the reason why we have to explicitly declare a encoding as
long as we have non-ASCII in source.

Bruno answered why we need specify a encoding when decoding a byte
string with perfect explanation, Thank you very much.
--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2008-12-19 Thread Marc 'BlackJack' Rintsch
On Fri, 19 Dec 2008 08:20:07 -0700, Joe Strout wrote:

 Marc 'BlackJack' Rintsch wrote:
 
 The question is why the Python interpreter use the default encoding
 instead of utf-8, which I explicitly declared in the source.
 
 Because the declaration is only for decoding unicode literals in that
 very source file.
 
 And because strings in Python, unlike in (say) REALbasic, do not know
 their encoding -- they're just a string of bytes.  If they were a string
 of bytes PLUS an encoding, then every string would know what it is, and
 things like conversion to another encoding, or concatenation of two
 strings that may differ in encoding, could be handled automatically.
 
 I consider this one of the great shortcomings of Python, but it's mostly
 just a temporary inconvenience -- the world is moving to Unicode, and
 with Python 3, we won't have to worry about it so much.

I don't see the shortcoming in Python 3.0.  If you want real strings 
with characters instead of just a bunch of bytes simply use `unicode` 
objects instead of `str`.

And does REALbasic really use byte strings plus an encoding!?  Sounds 
strange.  When concatenating which encoding wins?

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2008-12-19 Thread Joe Strout

Marc 'BlackJack' Rintsch wrote:


And because strings in Python, unlike in (say) REALbasic, do not know
their encoding -- they're just a string of bytes.  If they were a string
of bytes PLUS an encoding, then every string would know what it is, and
things like conversion to another encoding, or concatenation of two
strings that may differ in encoding, could be handled automatically.

I consider this one of the great shortcomings of Python, but it's mostly
just a temporary inconvenience -- the world is moving to Unicode, and
with Python 3, we won't have to worry about it so much.


I don't see the shortcoming in Python 3.0.  If you want real strings 
with characters instead of just a bunch of bytes simply use `unicode` 
objects instead of `str`.


Fair enough -- that certainly is the best policy.  But working with any 
other encoding (sometimes necessary when interfacing with any other 
software), it's still a bit of a PITA.



And does REALbasic really use byte strings plus an encoding!?


You betcha!  Works like a dream.


Sounds strange.  When concatenating which encoding wins?


The one that is a superset of the other, or if neither is, then both are 
converted to UTF-8 (which is the standard encoding in RB, though it 
works comfily with any other too).


Cheers,
- Joe

--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2008-12-19 Thread Marc 'BlackJack' Rintsch
On Fri, 19 Dec 2008 15:20:08 -0700, Joe Strout wrote:

 Marc 'BlackJack' Rintsch wrote:
 
 And because strings in Python, unlike in (say) REALbasic, do not know
 their encoding -- they're just a string of bytes.  If they were a
 string of bytes PLUS an encoding, then every string would know what it
 is, and things like conversion to another encoding, or concatenation
 of two strings that may differ in encoding, could be handled
 automatically.

 I consider this one of the great shortcomings of Python, but it's
 mostly just a temporary inconvenience -- the world is moving to
 Unicode, and with Python 3, we won't have to worry about it so much.
 
 I don't see the shortcoming in Python 3.0.  If you want real strings
 with characters instead of just a bunch of bytes simply use `unicode`
 objects instead of `str`.
 
 Fair enough -- that certainly is the best policy.  But working with any
 other encoding (sometimes necessary when interfacing with any other
 software), it's still a bit of a PITA.

But it has to be.  There is no automagic guessing possible.

 And does REALbasic really use byte strings plus an encoding!?
 
 You betcha!  Works like a dream.

IMHO a strange design decision.  A lot more hassle compared to an opaque 
unicode string type which uses some internal encoding that makes 
operations like getting a character at a given index easy or 
concatenating without the need to reencode.

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2008-12-19 Thread John Machin
On Dec 20, 10:02 am, Marc 'BlackJack' Rintsch bj_...@gmx.net wrote:
 On Fri, 19 Dec 2008 15:20:08 -0700, Joe Strout wrote:
  Marc 'BlackJack' Rintsch wrote:

  And because strings in Python, unlike in (say) REALbasic, do not know
  their encoding -- they're just a string of bytes.  If they were a
  string of bytes PLUS an encoding, then every string would know what it
  is, and things like conversion to another encoding, or concatenation
  of two strings that may differ in encoding, could be handled
  automatically.

  I consider this one of the great shortcomings of Python, but it's
  mostly just a temporary inconvenience -- the world is moving to
  Unicode, and with Python 3, we won't have to worry about it so much.

  I don't see the shortcoming in Python 3.0.  If you want real strings
  with characters instead of just a bunch of bytes simply use `unicode`
  objects instead of `str`.

  Fair enough -- that certainly is the best policy.  But working with any
  other encoding (sometimes necessary when interfacing with any other
  software), it's still a bit of a PITA.

 But it has to be.  There is no automagic guessing possible.

  And does REALbasic really use byte strings plus an encoding!?

  You betcha!  Works like a dream.

 IMHO a strange design decision.  A lot more hassle compared to an opaque
 unicode string type which uses some internal encoding that makes
 operations like getting a character at a given index easy or
 concatenating without the need to reencode.

In general I quite agree with you ... hoever with Unicode getting a
character at a given index is fine unless and until you stray (or are
dragged!) outside the BMP and you have only a 16-bit Unicode
implementation.
--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2008-12-19 Thread Joe Strout

Marc 'BlackJack' Rintsch wrote:


I don't see the shortcoming in Python 3.0.  If you want real strings
with characters instead of just a bunch of bytes simply use `unicode`
objects instead of `str`.

Fair enough -- that certainly is the best policy.  But working with any
other encoding (sometimes necessary when interfacing with any other
software), it's still a bit of a PITA.


But it has to be.  There is no automagic guessing possible.


Automagic guessing isn't possible if strings keep track of what encoding 
their data is.  And why shouldn't they?  We're a long way from the day 
when a string was nothing more than an array of bytes.  Adding a teeny 
bit of metadata makes life much easier.



And does REALbasic really use byte strings plus an encoding!?

You betcha!  Works like a dream.


IMHO a strange design decision.


I get that you don't grok it, but I think that's because you haven't 
worked with it.  RB added encoding data to its strings years ago, and 
changed the default string encoding to UTF-8 at about the same time, and 
life has been delightful since then.  The only time you ever have to 
think about it is when you're importing a string from some unknown 
source (e.g. a socket), at which point you need to tell RB what encoding 
it is.  From that point on, you can pass that string around, extract 
substrings, split it into words, concatenate it with other strings, 
etc., and it all Just Works (tm).


In comparison, Python requires a lot more thought on the part of the 
programmer to keep track of what's what (unless, as you point out, you 
convert everything into unicode strings as soon as you get them, but 
that can be a very expensive operation to do on, say, a 500MB UTF-8 text 
file).


A lot more hassle compared to an opaque 
unicode string type which uses some internal encoding that makes 
operations like getting a character at a given index easy or 
concatenating without the need to reencode.


No.  RB supports UCS-2 encoding, too, and is smart enough to take 
advantage of the fixed character width of any encoding when that's what 
a string happens to be.  And no reencoding is used when it's not 
necessary (e.g., concatenating two strings of the same encoding, or 
adding an ASCII string to a string using any ASCII superset, such as 
UTF-8).  There's nothing stopping you from converting all your strings 
to UCS-2 when you get them, if that's your preference.


But saying that having only one string type that knows it's Unicode, and 
another string type that hasn't the foggiest clue how to interpret its 
data as text, is somehow easier than every string knowing what it is and 
doing the right thing -- well, that's just silly.


Best,
- Joe

--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2008-05-16 Thread Mike Driscoll
On May 16, 3:31 pm, Luis Zarrabeitia [EMAIL PROTECTED] wrote:
 Hi, guys.
 I'm trying to read an xml file and output some of the nodes. For that, I'm
 doing a
 print node.toprettyxml()

 However, I get this exception:

 ===
     out.write(tag.toxml())
 UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in position
 190: ordinal not in range(128)
 ===

 That happens if I print it, or send it to stdout, or send it to a file.

 How can I fix it?
 cat file works perfectly, and I'm using an utf8 terminal.

 I'm particularly puzzled that it won't work even if I write to a file opened
 in b mode. Worst thing is... I don't really need that character, just a
 general idea of how the document looks like.

 --
 Luis Zarrabeitia (aka Kyrie)
 Fac. de Matemática y Computación, UH.http://profesores.matcom.uh.cu/~kyrie


I recommend studying up on Python's Unicode methods and the codecs
module. This site actually talks about your specific issue though and
gives pointers:

http://evanjones.ca/python-utf8.html

HTH

Mike
--
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem with web application (Paste+Mako)

2007-06-06 Thread Rob Wolfe

[EMAIL PROTECTED] wrote:
 Hi

 I have a problem with encoding non-ascii characters in a web
 application. The application uses Paste and Mako.

 The code is here: http://www.webudkast.dk/demo.txt

 The main points are:

 After getting some user generated input using
 paste.request.parse_formvars, how should this be correctly saved to
 file?

 How should this afterward be read from the file, and fed correctly
 into a Mako template?

You have to know the encoding of user input and then you
can use ``input_encoding`` and ``output_encoding`` parameters
of ``Template``. Mako internally handles everything as Python unicode
objects.
For example:

t = Template(filename=templ.mako, input_encoding=iso-8859-2,
 output_encoding=iso-8859-2)
content = t.render(**context)

--
HTH,
Rob

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem with web application (Paste+Mako)

2007-06-06 Thread Martin Skou
Rob Wolfe wrote:
 
 You have to know the encoding of user input and then you
 can use ``input_encoding`` and ``output_encoding`` parameters
 of ``Template``. Mako internally handles everything as Python unicode
 objects.
 For example:
 
 t = Template(filename=templ.mako, input_encoding=iso-8859-2,
  output_encoding=iso-8859-2)
 content = t.render(**context)
 
 --
 HTH,
 Rob
 

Thanks Rob

Using:

t=Template(content,input_encoding=utf-8, output_encoding=utf-8)

did the trick. Thanks for the help.

/Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2006-03-09 Thread Martin v. Löwis
Yves Glodt wrote:
 It seems in general I have trouble with special characters...
 What is the python way to deal with éàè öäü etc...
 
 print 'é' fails here,
 print u'é' as well :-(
 
 How am I supposed to print non-ascii characters the correct way?

The second form should be used, but not in interactive mode.
In a Python script, make sure you properly declare the encoding
of your script, e.g.

# -*- coding: iso-8859-1 -*-
print u'é'

That should work. If not, give us your Python version, operating
system name, and mode of operation.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2006-03-09 Thread Scott David Daniels
Yves Glodt wrote:
 It seems in general I have trouble with special characters...
 What is the python way to deal with éàè öäü etc...
 
 print 'é' fails here,
This should probably stay true.

 print u'é' as well :-(
This is an issue with how your output is connected.
What OS, what code page, what application?
I'm using Win2K, Python 2.4.2
Using Idle, I can do:
print u'élève'
And get what I expect
I can also do:
print repr(u'élève')
which gives me:
 u'\xe9l\xe8ve'
and:
 print u'\xe9l\xe8ve'
Also shows me élève.


With cmd.exe (the command line):
 c:\ python
  print u'\xe9l\xe8ve'
shows me élève, but I can't type in:
  print u'lve'
is what I get when I paste in the print u'élève' (beeps during paste).
What do you get if you put in:
  print repr('élève')

--Scott David Daniels
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2006-03-08 Thread Yves Glodt
Sebastjan Trepca wrote:
 I think you are trying to concatenate a unicode string with regular
 one so when it tries to convert the regular string to unicode with
 ASCII(default one) encoding it fails. First find out which of these
 strings is regular and how it was encoded, then you can decode it like
 this(if regular string is diff):
 
 mailbody +=diff.decode('correct encoding')

Thanks I'll look into that...

It seems in general I have trouble with special characters...
What is the python way to deal with éàè öäü etc...

print 'é' fails here,
print u'é' as well :-(

How am I supposed to print non-ascii characters the correct way?


best regards,
Yves

 Sebastjan
 
 On 3/3/06, Yves Glodt [EMAIL PROTECTED] wrote:
 Hi list,


 Playing with the great pysvn I get this problem:


 Traceback (most recent call last):
File D:\avn\mail.py, line 80, in ?
  mailbody += diff
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position
 10710: ordinal not in range(128)



 It seems the pysvn.client.diff function returns bytes (as I read in
 the changelog of pysvn: http://svn.haxx.se/dev/archive-2005-10/0466.shtml)

 How can I convert this string so that I can contatenate it to my
 regular string?


 Best regards,
 Yves
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2006-03-03 Thread Sebastjan Trepca
I think you are trying to concatenate a unicode string with regular
one so when it tries to convert the regular string to unicode with
ASCII(default one) encoding it fails. First find out which of these
strings is regular and how it was encoded, then you can decode it like
this(if regular string is diff):

mailbody +=diff.decode('correct encoding')

Sebastjan

On 3/3/06, Yves Glodt [EMAIL PROTECTED] wrote:
 Hi list,


 Playing with the great pysvn I get this problem:


 Traceback (most recent call last):
File D:\avn\mail.py, line 80, in ?
  mailbody += diff
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position
 10710: ordinal not in range(128)



 It seems the pysvn.client.diff function returns bytes (as I read in
 the changelog of pysvn: http://svn.haxx.se/dev/archive-2005-10/0466.shtml)

 How can I convert this string so that I can contatenate it to my
 regular string?


 Best regards,
 Yves
 --
 http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list