Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined
(1) what is produced on Anjanesh's machine sys.getdefaultencoding() 'utf-8' (2) it looks like a small snippet from a Python source file! Its a file containing just JSON data - but has some unicode characters as well as it has data from the web. Anjanesh, Is it a .py file Its a .json file. I have a bunch of these json files which Im parsing. using json library. Instead of something like, please report exactly what is there: print(ascii(open('the_file', 'rb').read()[10442-20:10442+21])) print(ascii(open('the_file', 'rb').read()[10442-20:10442+21])) b':42,query:0 1\xc2\xbb\xc3\x9d \\u2021 0\\u201a0 \\u2' Trouble with cases like this is as soon as they become interesting, the OP often snatches somebody's one-liner that works (i.e. doesn't raise an exception), makes a quick break for the county line, and they're not seen again :-) Actually, I moved the files to my Ubuntu PC which has Python 2.5.2 and didnt give the encoding issue. I just couldnt spend that much time on why a couple of these files had encoding issues in Py3 since I had to parse a whole lot of files. -- http://mail.python.org/mailman/listinfo/python-list
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined
Im reading a file. But there seems to be some encoding error. f = open(filename) data = f.read() Traceback (most recent call last): File pyshell#2, line 1, in module data = f.read() File C:\Python30\lib\io.py, line 1724, in read decoder.decode(self.buffer.read(), final=True)) File C:\Python30\lib\io.py, line 1295, in decode output = self.decoder.decode(input, final=final) File C:\Python30\lib\encodings\cp1252.py, line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined The string at position 10442 is something like this : query:0 1Ȉ \u2021 0\u201a0 \u2021»Ã, So what encoding value am I supposed to give ? I tried f = open(filename, encoding=cp1252) but still same error. I guess Python3 auto-detects it as cp1252 -- Anjanesh Lekshmnarayanan -- http://mail.python.org/mailman/listinfo/python-list
Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined
On Thu, Jan 29, 2009 at 11:24 AM, Anjanesh Lekshminarayanan m...@anjanesh.net wrote: Im reading a file. But there seems to be some encoding error. f = open(filename) data = f.read() Traceback (most recent call last): File pyshell#2, line 1, in module data = f.read() File C:\Python30\lib\io.py, line 1724, in read decoder.decode(self.buffer.read(), final=True)) File C:\Python30\lib\io.py, line 1295, in decode output = self.decoder.decode(input, final=final) File C:\Python30\lib\encodings\cp1252.py, line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined The string at position 10442 is something like this : query:0 1Ȉ \u2021 0\u201a0 \u2021Ȉ , So what encoding value am I supposed to give ? I tried f = open(filename, encoding=cp1252) but still same error. I guess Python3 auto-detects it as cp1252 It does auto-detect it as cp1252- look at the files in the traceback and you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong encoding, try opening it as utf-8 or latin1 and see if that fixes it. -- Anjanesh Lekshmnarayanan -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined
It does auto-detect it as cp1252- look at the files in the traceback and you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong encoding, try opening it as utf-8 or latin1 and see if that fixes it. Thanks a lot ! utf-8 and latin1 were accepted ! -- http://mail.python.org/mailman/listinfo/python-list
Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined
On Thu, Jan 29, 2009 at 12:09 PM, Anjanesh Lekshminarayanan m...@anjanesh.net wrote: It does auto-detect it as cp1252- look at the files in the traceback and you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong encoding, try opening it as utf-8 or latin1 and see if that fixes it. Thanks a lot ! utf-8 and latin1 were accepted ! -- If you want to read the file as text, find out which encoding it actually is. In one of those encodings, you'll probably see some nonsense characters. If you are just looking at the file as a sequence of bytes, open the file in binary mode rather than text. That way, you'll avoid this issue all together (just make sure you use byte strings instead of unicode strings). -- http://mail.python.org/mailman/listinfo/python-list
Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined
Anjanesh Lekshminarayanan mail at anjanesh.net writes: It does auto-detect it as cp1252- look at the files in the traceback and you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong encoding, try opening it as utf-8 or latin1 and see if that fixes it. Thanks a lot ! utf-8 and latin1 were accepted ! Just so you know, latin-1 can decode any sequence of bytes, so it will always work even if that's not the real encoding. -- http://mail.python.org/mailman/listinfo/python-list
Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined
Benjamin Kaplan bsk16 at case.edu writes: On Thu, Jan 29, 2009 at 12:09 PM, Anjanesh Lekshminarayanan mail at anjanesh.net wrote: It does auto-detect it as cp1252- look at the files in the traceback and you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong encoding, try opening it as utf-8 or latin1 and see if that fixes it. Benjamin, auto-detect has strong connotations of the open() call (with mode including text and encoding not specified) reading some/all of the file and trying to guess what the encoding might be -- a futile pursuit and not what the docs say: encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent, but any encoding supported by Python can be passed. See the codecs module for the list of supported encodings On my machine [Windows XL SP3] sys.getdefaultencoding() returns 'utf-8'. It would be interesting to know (1) what is produced on Anjanesh's machine (2) how the default encoding is derived (I would have thought I was a prime candidate for 'cp1252') (3) whether the 'default encoding' of open() is actually the same as the 'default encoding' of sys.getdefaultencoding() -- one would hope so but the docs don't say so. Thanks a lot ! utf-8 and latin1 were accepted ! Benjamin and Anjanesh, Please understand that any_random_rubbish.decode('latin1') will be accepted. This is *not* useful information to be greeted with thanks and exclamation marks. It is merely a by-product of the fact that *any* single-byte character set like latin1 that uses all 256 possible bytes can not fail, by definition; no character maps to undefined. If you want to read the file as text, find out which encoding it actually is. In one of those encodings, you'll probably see some nonsense characters. If you are just looking at the file as a sequence of bytes, open the file in binary mode rather than text. That way, you'll avoid this issue all together (just make sure you use byte strings instead of unicode strings). In fact, inspection of Anjanesh's report: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined The string at position 10442 is something like this : query:0 1»Ý \u2021 0\u201a0 \u2021»Ý, draws two observations: (1) there is nothing in the reported string that can be unambiguously identified as corresponding to 0x9d (2) it looks like a small snippet from a Python source file! Anjanesh, Is it a .py file? If so, is there something like # encoding: cp1252 or # encoding: utf-8 near the start of the file? *Please* tell us what sys.getdefaultencoding() returns on your machine. Instead of something like, please report exactly what is there: print(ascii(open('the_file', 'rb').read()[10442-20:10442+21])) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined
On Thu, Jan 29, 2009 at 4:19 PM, John Machin sjmac...@lexicon.net wrote: Benjamin Kaplan bsk16 at case.edu writes: On Thu, Jan 29, 2009 at 12:09 PM, Anjanesh Lekshminarayanan mail at anjanesh.net wrote: It does auto-detect it as cp1252- look at the files in the traceback and you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong encoding, try opening it as utf-8 or latin1 and see if that fixes it. Benjamin, auto-detect has strong connotations of the open() call (with mode including text and encoding not specified) reading some/all of the file and trying to guess what the encoding might be -- a futile pursuit and not what the docs say: encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent, but any encoding supported by Python can be passed. See the codecs module for the list of supported encodings On my machine [Windows XL SP3] sys.getdefaultencoding() returns 'utf-8'. It would be interesting to know (1) what is produced on Anjanesh's machine (2) how the default encoding is derived (I would have thought I was a prime candidate for 'cp1252') (3) whether the 'default encoding' of open() is actually the same as the 'default encoding' of sys.getdefaultencoding() -- one would hope so but the docs don't say so. First of all, you're right that might be confusing. I was thinking of auto-detect as in check the platform and locale and guess what they usually use. I wasn't thinking of it like the web browsers use it. I think it uses locale.getpreferredencoding(). On my machine, I get sys.getpreferredencoding() == 'utf-8' and locale.getdefaultencoding()== 'cp1252'. When I open a file without specifying the encoding, it's cp1252. Thanks a lot ! utf-8 and latin1 were accepted ! Benjamin and Anjanesh, Please understand that any_random_rubbish.decode('latin1') will be accepted. This is *not* useful information to be greeted with thanks and exclamation marks. It is merely a by-product of the fact that *any* single-byte character set like latin1 that uses all 256 possible bytes can not fail, by definition; no character maps to undefined. If you check my response to Anjanesh's comment, I mentioned that he should either find out which encoding it is in particular or he should open the file in binary mode. I suggested utf-8 and latin1 because those are the most likely candidates for his file since cp1252 was already excluded. Looking at a character map, 0x9d is a control character in latin1, so the page is probably UTF-8 encoded. Thinking about it now, it could also be MacRoman but that isn't as common as UTF-8. If you want to read the file as text, find out which encoding it actually is. In one of those encodings, you'll probably see some nonsense characters. If you are just looking at the file as a sequence of bytes, open the file in binary mode rather than text. That way, you'll avoid this issue all together (just make sure you use byte strings instead of unicode strings). In fact, inspection of Anjanesh's report: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined The string at position 10442 is something like this : query:0 1»Ý \u2021 0\u201a0 \u2021»Ý, draws two observations: (1) there is nothing in the reported string that can be unambiguously identified as corresponding to 0x9d (2) it looks like a small snippet from a Python source file! Anjanesh, Is it a .py file? If so, is there something like # encoding: cp1252 or # encoding: utf-8 near the start of the file? *Please* tell us what sys.getdefaultencoding() returns on your machine. Instead of something like, please report exactly what is there: print(ascii(open('the_file', 'rb').read()[10442-20:10442+21])) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to undefined
Benjamin Kaplan benjamin.kaplan at case.edu writes: First of all, you're right that might be confusing. I was thinking of auto-detect as in check the platform and locale and guess what they usually use. I wasn't thinking of it like the web browsers use it.I think it uses locale.getpreferredencoding(). You're probably right. I'd forgotten about locale.getpreferredencoding(). I'll raise a request on the bug tracker to get some more precise wording in the open() docs. On my machine, I get sys.getpreferredencoding() == 'utf-8' and locale.getdefaultencoding()== 'cp1252'. sys - locale ... +1 long-range transposition typo of the year :-) If you check my response to Anjanesh's comment, I mentioned that he should either find out which encoding it is in particular or he should open the file in binary mode. I suggested utf-8 and latin1 because those are the most likely candidates for his file since cp1252 was already excluded. The OP is on a Windows machine. His file looks like a source code file. He is unlikely to be creating latin1 files himself on a Windows box. Under the hypothesis that he is accidentally or otherwise reading somebody else's source files as data, it could be any encoding. In one package with which I'm familiar, the encoding is declared as cp1251 in every .py file; AFAICT the only file with non-ASCII characters is an example script containing his wife's name! The OP's 0x9d is a defined character in code pages 1250, 1251, 1256, and 1257 -- admittedly all as implausible as the latin1 control character. Looking at a character map, 0x9d is a control character in latin1, so the page is probably UTF-8 encoded. Thinking about it now, it could also be MacRoman but that isn't as common as UTF-8. Late breaking news: I presume you can see two instances of U+00DD (LATIN CAPITAL LETTER Y WITH ACUTE) in the OP's report query:0 1»Ý \u2021 0\u201a0 \u2021»Ý, Well, u'\xdd'.encode('utf8') is '\xc3\x9d' ... the Bayesian score for utf8 just went up a notch. The preceding character U+00BB (looks like ) doesn't cause an exception because 0xBB unlike 0x9D is defined in cp1252. Curiously looking at the \u escape sequences: \u2021 is double dagger, \u201a is single low-9 quotation mark ... what appears to be the value part of an item in a hard-coded dictionary is about as comprehensible as the Voynich manuscript. Trouble with cases like this is as soon as they become interesting, the OP often snatches somebody's one-liner that works (i.e. doesn't raise an exception), makes a quick break for the county line, and they're not seen again :-) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list