[issue28246] Unable to read simple text file

2016-09-22 Thread Eryk Sun

Eryk Sun added the comment:

Codepage 1251 is a single-byte encoding and a superset of ASCII (i.e. ordinals 
0-127). UTF-8 is also a superset of ASCII, so there's no problem as long as the 
encoded text is strictly ASCII. But decoding non-ASCII UTF-8 as codepage 1251 
produces nonsense, otherwise known as mojibake. It happens that codepage 1251 
maps every one of the 256 possible byte values, except for 0x98 (152). The 
exception can't be made any clearer.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28246] Unable to read simple text file

2016-09-22 Thread AndreyTomsk

AndreyTomsk added the comment:

Thanks for quick reply. I'm new to python, just used tutorial docs and didn't 
read carefully enough to notice encoding info.

Still, IMHO behaviour not consistent. In three sequential symbols in russian 
alphabet - З, И, К, it crashes on И, and displays other in two-byte form.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28246] Unable to read simple text file

2016-09-22 Thread SilentGhost

SilentGhost added the comment:

It would be good to add a FAQ / HowTo entry for this question.

--
nosy: +SilentGhost

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28246] Unable to read simple text file

2016-09-22 Thread Eryk Sun

Eryk Sun added the comment:

The default encoding on your system is Windows codepage 1251. However, your 
file is encoded using UTF-8:

>>> lines = open('ResourceStrings.rc', 'rb').read().splitlines()
>>> print(*lines, sep='\n')
b'\xef\xbb\xbf\xd0\x90 (cyrillic A)'
b'\xd0\x98 (cyrillic I) <<< line read fails'
b'\xd0\x91 (cyrillic B)'

It even has a UTF-8 BOM (i.e. b'\xef\xbb\xbf'). You need to pass the encoding 
to built-in open():

>>> print(open('ResourceStrings.rc', encoding='utf-8').read())
А (cyrillic A)
И (cyrillic I) <<< line read fails
Б (cyrillic B)

--
nosy: +eryksun
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28246] Unable to read simple text file

2016-09-22 Thread AndreyTomsk

New submission from AndreyTomsk:

File read operation fails when gets specific cyrillic symbol. Tested with 
script:

testFile = open('ResourceStrings.rc', 'r')
for line in testFile:
print(line)


Exception message:
Traceback (most recent call last):
  File "min_test.py", line 6, in 
for line in testFile:
  File 
"C:\Users\afi\AppData\Local\Programs\Python\Python36\lib\encodings\cp1251.py", 
line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 24: 
character maps to 

--
components: IO, Unicode, Windows
files: ResourceStrings.rc
messages: 277206
nosy: AndreyTomsk, ezio.melotti, haypo, paul.moore, steve.dower, tim.golden, 
zach.ware
priority: normal
severity: normal
status: open
title: Unable to read simple text file
type: behavior
versions: Python 3.5, Python 3.6
Added file: http://bugs.python.org/file44783/ResourceStrings.rc

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com