[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2016-02-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Serhiy: Removing the shortcut would slow down the tokenizer a lot since UTF-8 
encoded source code is the norm, not the exception.

The "problem" here is that the tokenizer trusts the source code in being in the 
correct encoding when you use one of utf-8 or iso-8859-1 and then skips the 
usual "decode into unicode, then encode to utf-8" step.

>From a purist point of view, you are right, Python should always pass through 
>those steps to detect encoding errors, but from a practical point of view, I 
>think the optimization is fine.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2016-02-10 Thread Jim Jewett

Jim Jewett added the comment:

Does (did?) the utf8 special case allow for a much faster startup time, by not 
requiring all of the codecs machinery?

--
nosy: +Jim.Jewett

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2016-02-09 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I think the correct way is not add "utf8" to special case, but removes "utf-8". 
Here is a patch.

--
components: +Interpreter Core
stage:  -> patch review
type:  -> behavior
Added file: http://bugs.python.org/file41879/bad_utf8.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-27 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 27.12.2015 02:05, Serhiy Storchaka wrote:
> 
>> I wonder why this does not trigger the exception.
> 
> Because in case of utf-8 and iso-8859-1 decoding and encoding steps are 
> omitted.
>
> In general case the input is decoded from specified encoding and than encoded 
> to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets 
> the raw data.

Right, but since the tokenizer doesn't know about "utf8" it
should reach out to the codec registry to get a properly encoded
version of the source code (even though this is an unnecessary
round-trip).

There are few other aliases for UTF-8 which would likely trigger
the same problem:

# utf_8 codec
'u8' : 'utf_8',
'utf': 'utf_8',
'utf8'   : 'utf_8',
'utf8_ucs2'  : 'utf_8',
'utf8_ucs4'  : 'utf_8',

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-26 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Please fold these cases into one:

 if (strcmp(buf, "utf-8") == 0 ||
 strncmp(buf, "utf-8-", 6) == 0)
 return "utf-8";
 else if (strcmp(buf, "utf8") == 0 ||
 strncmp(buf, "utf8-", 6) == 0)
 return "utf-8";

->

 if (strcmp(buf, "utf-8") == 0 ||
 strncmp(buf, "utf-8-", 6) == 0 ||
 strcmp(buf, "utf8") == 0 ||
 strncmp(buf, "utf8-", 6) == 0)
 return "utf-8";

Also: I wonder why the regular utf_8.py codec doesn't complain about this case, 
since the above are only shortcuts for frequently used source code encodings.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-26 Thread 王杰

王杰 added the comment:

Python 2.7

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-26 Thread STINNER Victor

STINNER Victor added the comment:

> Here is a fix with a patch.

Oops, I mean 'with an unit test', sorry ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-26 Thread 王杰

王杰 added the comment:

I'm learning about Python's encoding rule and I write it as a test case.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-26 Thread STINNER Victor

STINNER Victor added the comment:

> I has a file "gbk-utf-8.py" and it's encoding is GBK.

I don't understand why you use "# coding: utf-8" if the file is encoded to GBK. 
Why not using "# coding: gbk"?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-26 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 26.12.2015 22:46, STINNER Victor wrote:
> 
> In Python, there are multiple implementations of the utf-8 codec with many
> shortcuts. I'm not surprised to see bugs depending on the exact syntax of
> the utf-8 codec name. Maybe we need to share even more code to normalize
> and compare codec names. (I think that py3 is better than py2 on this part.)

There's only one implementation (the one in unicodeobject.c), which is used
directly or via the wrapper in the encodings package, but there
are a few shortcuts to bypass the codec registry scattered around
the code since UTF-8 is such a commonly used codec.

In the case in question, the codec registry should trigger decoding
via the encodings package (rather than going directly to C APIs),
so will eventually end up using the same code. I wonder why this does not
trigger the exception.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-26 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> I wonder why this does not trigger the exception.

Because in case of utf-8 and iso-8859-1 decoding and encoding steps are omitted.

In general case the input is decoded from specified encoding and than encoded 
to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets the 
raw data.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-26 Thread STINNER Victor

STINNER Victor added the comment:

In Python, there are multiple implementations of the utf-8 codec with many
shortcuts. I'm not surprised to see bugs depending on the exact syntax of
the utf-8 codec name. Maybe we need to share even more code to normalize
and compare codec names. (I think that py3 is better than py2 on this part.)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-25 Thread Terry J. Reedy

Terry J. Reedy added the comment:

What Python version?

--
nosy: +terry.reedy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-25 Thread Terry J. Reedy

Changes by Terry J. Reedy :


--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-25 Thread Terry J. Reedy

Changes by Terry J. Reedy :


--
nosy: +doerwalter, lemburg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25937] DIfference between utf8 and utf-8 when i define python source code encoding.

2015-12-23 Thread 王杰

New submission from 王杰:

I use CentOS 7.0 and change LANG=gbk.

I has a file "gbk-utf-8.py" and it's encoding is GBK.

# -*- coding:utf-8 -*-
import chardet
if __name__ == '__main__':
s = '中文'
print s, chardet.detect(s) 

I execute it and everything is ok. However it raise "SyntaxError" (as I 
expected) after I change "encoding:utf-8" to "encoding:utf8".

  File "gbk-utf8.py", line 2
SyntaxError: 'utf8' codec can't decode byte 0xd6 in position 0: invalid 
continuation byte

Is this ok? Or where I wrong?

--
messages: 256952
nosy: 王杰
priority: normal
severity: normal
status: open
title: DIfference between utf8 and utf-8 when i define python source code 
encoding.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com