Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined

2009-02-26 Thread Anjanesh Lekshminarayanan
 (1) what is produced on Anjanesh's machine
 sys.getdefaultencoding()
'utf-8'

 (2) it looks like a small snippet from a Python source file!
Its a file containing just JSON data - but has some unicode characters
as well as it has data from the web.

 Anjanesh, Is it a .py file
Its a .json file. I have a bunch of these json files which Im parsing.
using json library.

 Instead of something like, please report exactly what is there:

 print(ascii(open('the_file', 'rb').read()[10442-20:10442+21]))
 print(ascii(open('the_file', 'rb').read()[10442-20:10442+21]))
b':42,query:0 1\xc2\xbb\xc3\x9d \\u2021 0\\u201a0 \\u2'

 Trouble with cases like this is as soon as they become interesting, the OP 
 often
snatches somebody's one-liner that works (i.e. doesn't raise an exception),
makes a quick break for the county line, and they're not seen again :-)

Actually, I moved the files to my Ubuntu PC which has Python 2.5.2 and
didnt give the encoding issue. I just couldnt spend that much time on
why a couple of these files had encoding issues in Py3 since I had to
parse a whole lot of files.
--
http://mail.python.org/mailman/listinfo/python-list


UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined

2009-01-29 Thread Anjanesh Lekshminarayanan
Im reading a file. But there seems to be some encoding error.

 f = open(filename)
 data = f.read()
Traceback (most recent call last):
  File pyshell#2, line 1, in module
data = f.read()
  File C:\Python30\lib\io.py, line 1724, in read
decoder.decode(self.buffer.read(), final=True))
  File C:\Python30\lib\io.py, line 1295, in decode
output = self.decoder.decode(input, final=final)
  File C:\Python30\lib\encodings\cp1252.py, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
10442: character maps to undefined

The string at position 10442 is something like this :
query:0 1»Ý \u2021 0\u201a0 \u2021»Ý,

So what encoding value am I supposed to give ? I tried f =
open(filename, encoding=cp1252) but still same error. I guess
Python3 auto-detects it as cp1252
-- 
Anjanesh Lekshmnarayanan
--
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined

2009-01-29 Thread Benjamin Kaplan
On Thu, Jan 29, 2009 at 11:24 AM, Anjanesh Lekshminarayanan 
m...@anjanesh.net wrote:

 Im reading a file. But there seems to be some encoding error.

  f = open(filename)
  data = f.read()
 Traceback (most recent call last):
  File pyshell#2, line 1, in module
data = f.read()
  File C:\Python30\lib\io.py, line 1724, in read
decoder.decode(self.buffer.read(), final=True))
  File C:\Python30\lib\io.py, line 1295, in decode
output = self.decoder.decode(input, final=final)
  File C:\Python30\lib\encodings\cp1252.py, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
 UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
 10442: character maps to undefined

 The string at position 10442 is something like this :
 query:0 1Ȉ \u2021 0\u201a0 \u2021Ȉ ,

 So what encoding value am I supposed to give ? I tried f =
 open(filename, encoding=cp1252) but still same error. I guess
 Python3 auto-detects it as cp1252


It does auto-detect it as cp1252- look at the files in the traceback and
you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong
encoding, try opening it as utf-8 or latin1 and see if that fixes it.


 --
 Anjanesh Lekshmnarayanan
 --
 http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined

2009-01-29 Thread Anjanesh Lekshminarayanan
 It does auto-detect it as cp1252- look at the files in the traceback and
 you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong
 encoding, try opening it as utf-8 or latin1 and see if that fixes it.

Thanks a lot ! utf-8 and latin1 were accepted !
--
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined

2009-01-29 Thread Benjamin Kaplan
On Thu, Jan 29, 2009 at 12:09 PM, Anjanesh Lekshminarayanan 
m...@anjanesh.net wrote:

  It does auto-detect it as cp1252- look at the files in the traceback and
  you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong
  encoding, try opening it as utf-8 or latin1 and see if that fixes it.

 Thanks a lot ! utf-8 and latin1 were accepted !
 --


If you want to read the file as text, find out which encoding it actually
is. In one of those encodings, you'll probably see some nonsense characters.
If you are just looking at the file as a sequence of bytes, open the file in
binary mode rather than text. That way, you'll avoid this issue all together
(just make sure you use byte strings instead of unicode strings).
--
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined

2009-01-29 Thread Benjamin Peterson
Anjanesh Lekshminarayanan mail at anjanesh.net writes:

 
  It does auto-detect it as cp1252- look at the files in the traceback and
  you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong
  encoding, try opening it as utf-8 or latin1 and see if that fixes it.
 
 Thanks a lot ! utf-8 and latin1 were accepted !

Just so you know, latin-1 can decode any sequence of bytes, so it will always
work even if that's not the real encoding.




--
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined

2009-01-29 Thread John Machin
Benjamin Kaplan bsk16 at case.edu writes:

 
 
 On Thu, Jan 29, 2009 at 12:09 PM, Anjanesh Lekshminarayanan mail at
anjanesh.net wrote:
  It does auto-detect it as cp1252- look at the files in the traceback and
  you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong
  encoding, try opening it as utf-8 or latin1 and see if that fixes it.

Benjamin, auto-detect has strong connotations of the open() call (with mode
including text and encoding not specified) reading some/all of the file and
trying to guess what the encoding might be -- a futile pursuit and not what the
docs say: 

encoding is the name of the encoding used to decode or encode the file. This
should only be used in text mode. The default encoding is platform dependent,
but any encoding supported by Python can be passed. See the codecs module for
the list of supported encodings

On my machine [Windows XL SP3] sys.getdefaultencoding() returns 'utf-8'. It
would be interesting to know
(1) what is produced on Anjanesh's machine
(2) how the default encoding is derived (I would have thought I was a prime
candidate for 'cp1252')
(3) whether the 'default encoding' of open() is actually the same as the
'default encoding' of sys.getdefaultencoding() -- one would hope so but the docs
don't say so.

 Thanks a lot ! utf-8 and latin1 were accepted !

Benjamin and Anjanesh, Please understand that
any_random_rubbish.decode('latin1') will be accepted. This is *not* useful
information to be greeted with thanks and exclamation marks. It is merely a
by-product of the fact that *any* single-byte character set like latin1 that
uses all 256 possible bytes can not fail, by definition; no character maps to
undefined.

 If you want to read the file as text, find out which encoding it actually is.
In one of those encodings, you'll probably see some nonsense characters. If you
are just looking at the file as a sequence of bytes, open the file in binary
mode rather than text. That way, you'll avoid this issue all together (just make
sure you use byte strings instead of unicode strings).

In fact, inspection of Anjanesh's report:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
10442: character maps to undefined 
The string at position 10442 is something like this :
query:0 1»Ý \u2021 0\u201a0 \u2021»Ý,  

draws two observations:
(1) there is nothing in the reported string that can be unambiguously identified
as corresponding to 0x9d
(2) it looks like a small snippet from a Python source file!

Anjanesh, Is it a .py file? If so, is there something like # encoding: cp1252
or # encoding: utf-8 near the start of the file? *Please* tell us what
sys.getdefaultencoding() returns on your machine.

Instead of something like, please report exactly what is there:

print(ascii(open('the_file', 'rb').read()[10442-20:10442+21]))

Cheers,
John

--
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined

2009-01-29 Thread Benjamin Kaplan
On Thu, Jan 29, 2009 at 4:19 PM, John Machin sjmac...@lexicon.net wrote:

 Benjamin Kaplan bsk16 at case.edu writes:

 
 
  On Thu, Jan 29, 2009 at 12:09 PM, Anjanesh Lekshminarayanan mail at
 anjanesh.net wrote:
   It does auto-detect it as cp1252- look at the files in the traceback
 and
   you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong
   encoding, try opening it as utf-8 or latin1 and see if that fixes it.

 Benjamin, auto-detect has strong connotations of the open() call (with
 mode
 including text and encoding not specified) reading some/all of the file and
 trying to guess what the encoding might be -- a futile pursuit and not what
 the
 docs say:

 encoding is the name of the encoding used to decode or encode the file.
 This
 should only be used in text mode. The default encoding is platform
 dependent,
 but any encoding supported by Python can be passed. See the codecs module
 for
 the list of supported encodings

 On my machine [Windows XL SP3] sys.getdefaultencoding() returns 'utf-8'. It
 would be interesting to know
 (1) what is produced on Anjanesh's machine
 (2) how the default encoding is derived (I would have thought I was a prime
 candidate for 'cp1252')
 (3) whether the 'default encoding' of open() is actually the same as the
 'default encoding' of sys.getdefaultencoding() -- one would hope so but the
 docs
 don't say so.


First of all, you're right that might be confusing. I was thinking of
auto-detect as in check the platform and locale and guess what they usually
use. I wasn't thinking of it like the web browsers use it.

I think it uses locale.getpreferredencoding(). On my machine, I get
sys.getpreferredencoding() == 'utf-8' and locale.getdefaultencoding()==
'cp1252'. When I open a file without specifying the encoding, it's cp1252.



  Thanks a lot ! utf-8 and latin1 were accepted !

 Benjamin and Anjanesh, Please understand that
 any_random_rubbish.decode('latin1') will be accepted. This is *not*
 useful
 information to be greeted with thanks and exclamation marks. It is merely a
 by-product of the fact that *any* single-byte character set like latin1
 that
 uses all 256 possible bytes can not fail, by definition; no character maps
 to
 undefined.


If you check my response to Anjanesh's comment, I mentioned that he should
either find out which encoding it is in particular or he should open the
file in binary mode. I suggested utf-8 and latin1 because those are the most
likely candidates for his file since cp1252 was already excluded. Looking at
a character map, 0x9d is a control character in latin1, so the page is
probably UTF-8 encoded. Thinking about it now, it could also be MacRoman but
that isn't as common as UTF-8.


  If you want to read the file as text, find out which encoding it actually
 is.
 In one of those encodings, you'll probably see some nonsense characters. If
 you
 are just looking at the file as a sequence of bytes, open the file in
 binary
 mode rather than text. That way, you'll avoid this issue all together (just
 make
 sure you use byte strings instead of unicode strings).

 In fact, inspection of Anjanesh's report:
 UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
 10442: character maps to undefined
 The string at position 10442 is something like this :
 query:0 1»Ý \u2021 0\u201a0 \u2021»Ý, 

 draws two observations:
 (1) there is nothing in the reported string that can be unambiguously
 identified
 as corresponding to 0x9d
 (2) it looks like a small snippet from a Python source file!

 Anjanesh, Is it a .py file? If so, is there something like # encoding:
 cp1252
 or # encoding: utf-8 near the start of the file? *Please* tell us what
 sys.getdefaultencoding() returns on your machine.

 Instead of something like, please report exactly what is there:

 print(ascii(open('the_file', 'rb').read()[10442-20:10442+21]))

 Cheers,
 John

 --
 http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined

2009-01-29 Thread John Machin
Benjamin Kaplan benjamin.kaplan at case.edu writes:

 First of all, you're right that might be confusing. I was thinking of
auto-detect as in check the platform and locale and guess what they usually
use. I wasn't thinking of it like the web browsers use it.I think it uses
locale.getpreferredencoding().

You're probably right. I'd forgotten about locale.getpreferredencoding(). I'll
raise a request on the bug tracker to get some more precise wording in the
open() docs.

 On my machine, I get sys.getpreferredencoding() == 'utf-8' and
locale.getdefaultencoding()== 'cp1252'. 

sys - locale ... +1 long-range transposition typo of the year :-)

 If you check my response to Anjanesh's comment, I mentioned that he should
either find out which encoding it is in particular or he should open the file in
binary mode. I suggested utf-8 and latin1 because those are the most likely
candidates for his file since cp1252 was already excluded.

The OP is on a Windows machine. His file looks like a source code file. He is
unlikely to be creating latin1 files himself on a Windows box. Under the
hypothesis that he is accidentally or otherwise reading somebody else's source
files as data, it could be any encoding. In one package with which I'm familiar,
the encoding is declared as cp1251 in every .py file; AFAICT the only file with
non-ASCII characters is an example script containing his wife's name!

The OP's 0x9d is a defined character in code pages 1250, 1251, 1256, and 1257 --
admittedly all as implausible as the latin1 control character.

 Looking at a character map, 0x9d is a control character in latin1, so the page
is probably UTF-8 encoded. Thinking about it now, it could also be MacRoman but
that isn't as common as UTF-8.

Late breaking news: I presume you can see two instances of U+00DD (LATIN CAPITAL
LETTER Y WITH ACUTE) in the OP's report 
query:0 1»Ý \u2021 0\u201a0 \u2021»Ý,

Well, u'\xdd'.encode('utf8') is '\xc3\x9d' ... the Bayesian score for utf8 just
went up a notch.

The preceding character U+00BB (looks like ) doesn't cause an exception
because 0xBB unlike 0x9D is defined in cp1252.

Curiously looking at the \u escape sequences:
\u2021 is double dagger, \u201a is single low-9 quotation mark ... what
appears to be the value part of an item in a hard-coded dictionary is about as
comprehensible as the Voynich manuscript.

Trouble with cases like this is as soon as they become interesting, the OP often
snatches somebody's one-liner that works (i.e. doesn't raise an exception),
makes a quick break for the county line, and they're not seen again :-)

Cheers,
John


--
http://mail.python.org/mailman/listinfo/python-list