[issue24117] Wrong range checking in GB18030 decoder.

2017-04-07 Thread Ma Lin

Ma Lin added the comment:

I closed this issue, because it involved too many things.
 
1, for GB18030 decoder bug, see issue29990.
2, for hz encoder bug, see issue30003.
3, for problem in Traditional Chinese codecs, please create a new issue.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2017-04-04 Thread Ma Lin

Changes by Ma Lin :


--
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2016-11-14 Thread Mingye Wang

Mingye Wang added the comment:

Just FYI, cp950 0xC6A1 (\uf6b1) is found in current WindowsBestFit: 
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt

--
nosy: +Artoria2e5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2016-01-02 Thread Ezio Melotti

Ezio Melotti added the comment:

Did you hear anything back from them?

--
versions: +Python 3.6 -Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2016-01-02 Thread Ma Lin

Ma Lin added the comment:

I posted in a Taiwanese forum: https://groups.google.com/forum/#!forum/pythontw
no reply yet.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-19 Thread Ma Lin

Ma Lin added the comment:

 If you could provide links to the relevant pages/section we can verify that 
 the codecs are indeed incorrect. 

Here is CP950, 0xC6A1 is not in it.
https://msdn.microsoft.com/zh-cn/goglobal/cc305155

I can provide one link, but there are many variants of BIG5 convert table on 
the Interenet, so one link doesn't bring persuasion.

In this page: https://moztw.org/docs/big5/
Listed many variants of BIG5 tables, I found 0xC6A1-U+30FE in this table 
Unicode 1.1, the description of it is it's a terrible table, many errors 
exist, sadlly many foreigners are using it, but IIRC Python's BIG5 codec is 
not fully same as that table.

IMO, the most reliable way is reading a lot of stuff, and verify the key points 
and conflicts with authoritative source, but this way is very hard for 
foreigners.

Anyway, let's wait Taiwanese and their opinion for whether this should be fixed.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-18 Thread Ma Lin

Ma Lin added the comment:

This is not a de-facto standard, it should be fixed.
I already posted this infomation on a Taiwan Python community, let's wait their 
inspection.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-18 Thread Ma Lin

Ma Lin added the comment:

 I examined all Chinese codecs
I said it above, but I forgot Taiwan and HongKong are using Chinese as well.

BIG5 and CP950 are using a wrong convert table, test this:
 u = b'\xC6\xA1'.decode('big5')
 hex(ord(u))
'0x30fe'

This should not happen, 0xC6A1 is neither in BIG5 nor in CP950.
In BIG5-2003 and HKSCS-2008, 0xC6A1 is mapped to U+2460.

I only had a look roughly, please check more.
I won't check HongKong codec anymore, I suggest check it as well.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-18 Thread Ezio Melotti

Ezio Melotti added the comment:

 The data come from ICU, Unicode.org, IBM, 

If you could provide links to the relevant pages/section we can verify that the 
codecs are indeed incorrect.  Also keep in mind that there might people relying 
on these incorrectness, so we have to be careful while changing them.  This is 
especially true if there are de-facto standards that diverge from the actual 
standards.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-07 Thread Ma Lin

Changes by Ma Lin wjss...@sohu.com:


Added file: http://bugs.python.org/file39319/forpy34.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-07 Thread Ma Lin

Changes by Ma Lin wjss...@sohu.com:


Added file: http://bugs.python.org/file39320/forpy35.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-07 Thread Ma Lin

Changes by Ma Lin wjss...@sohu.com:


Removed file: http://bugs.python.org/file39278/forpy3.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-07 Thread Ma Lin

Changes by Ma Lin wjss...@sohu.com:


Removed file: http://bugs.python.org/file39277/forpy27.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-07 Thread Ma Lin

Ma Lin added the comment:

I examined all Chinese codecs, here are the patches, please review them, feel 
free to ask me your question.

Thanks to Hye-Shik, your framework is very easy to understand :)

--
Added file: http://bugs.python.org/file39318/forpy27.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-07 Thread Ezio Melotti

Ezio Melotti added the comment:

Do you have authoritative links that describe these standards?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-07 Thread Ma Lin

Ma Lin added the comment:

Good question.

GB2312:
I tested those programming languages one by one.

GBK/CP936/GB18030-2000:
I gathered data via Internet as much as I can, then compare them to Python3's 
codecs. I check key points with authoritative source, and verify every appeared 
conflicts.
The data come from ICU, Unicode.org, IBM, Chinese researchers, and data found 
by Google.

I had spent about half-month to do this, not just started from several days 
ago. I hope those descriptions will help late comers.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-04 Thread Ma Lin

Ma Lin added the comment:

I found another bug in hz codec.
hz encoding uses 7-bit ASCII to represent Chinese characters, it was popular in 
USENET networks in the late 1980s and early 1990s.

I will do more check and fix them together, then I will invite you to review 
the patch.


u = 'hi~python'
b = u.encode('hz')   # bug in this step, the right sequence should be 
bhi~~python
print(b)# the output is bhi~python

u = b.decode('hz')   # so can't decode, UnicodeDecodeError raised
print(u)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-03 Thread Ma Lin

Changes by Ma Lin wjss...@sohu.com:


--
title: A small bug in GB18030 decoder. - Wrong range checking in GB18030 
decoder.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-03 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
nosy: +lemburg, loewis, serhiy.storchaka
stage:  - patch review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2015-05-03 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Adding Hye-Shik who wrote the codec.

--
nosy: +hyeshik.chang

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24117
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com