[Python-Dev] relation between Python.asdl and Tools/compiler/ast.txt

2010-01-07 Thread Yoann Padioleau
Hi,

I would like to use astgen.py to generate python classes corresponding to the 
AST of something I have defined in a .asdl file, along the line of what is
apparently done for the python AST itself. I thought astgen.py would
take as an argument a .asdl file, but apparently it instead process a file
called ast.txt. Where does this file come from ? Is it generated from
Python.asdl ?

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Backported faster RLock to Python 2.6.

2010-01-07 Thread Johan Gill

Hi devs,
the company where I work has done some work on Python, and the question 
is how this work, owned by the company, can be contributed to the 
community properly. Are there any license issues or other pitfalls we 
need to think about? I imagine that other companies have contributed 
before, so this is probably an already solved problem.


Regards
Johan Gill
Agama Technologies

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Backported faster RLock to Python 2.6.

2010-01-07 Thread Lennart Regebro
On Thu, Jan 7, 2010 at 10:46, Johan Gill johan.g...@agama.tv wrote:
 Hi devs,
 the company where I work has done some work on Python, and the question is
 how this work, owned by the company, can be contributed to the community
 properly. Are there any license issues or other pitfalls we need to think
 about? I imagine that other companies have contributed before, so this is
 probably an already solved problem.

I'm not a license lawyer, but typically your company needs to give the
code to the community. Yes, it means it stops owning it.

-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Backported faster RLock to Python 2.6.

2010-01-07 Thread Simon Cross
On Thu, Jan 7, 2010 at 1:12 PM, Lennart Regebro rege...@gmail.com wrote:
 I'm not a license lawyer, but typically your company needs to give the
 code to the community. Yes, it means it stops owning it.

This is incorrect.

The correct information is at http://www.python.org/psf/contrib/.

Schiavo
Simon
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] test_ctypes failure on AIX 5.3 using python-2.6.2 and libffi-3.0.9

2010-01-07 Thread swamy sangamesh
Hi All,

I built the python-2.6.2 with the latest libffi-3.0.9 in AIX 5.3 using xlc
compiler.
When i try to run the ctypes test cases, two failures are seen in
test_bitfields.

*test_ints (ctypes.test.test_bitfields.C_Test) ... FAIL
test_shorts (ctypes.test.test_bitfields.C_Test) ... FAIL*

I have attached the full test case result.

If i change the type c_int and c_short to c_unit and c_ushort of class
BITS(Structure) in file
test_bitfields.py then no failures are seen.

Has anyone faced the similar issue or any help is appreciated.


Thanks,
Sangamesh


ctype-testcases
Description: Binary data
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Backported faster RLock to Python 2.6.

2010-01-07 Thread Nick Coghlan
Lennart Regebro wrote:
 On Thu, Jan 7, 2010 at 10:46, Johan Gill johan.g...@agama.tv wrote:
 Hi devs,
 the company where I work has done some work on Python, and the question is
 how this work, owned by the company, can be contributed to the community
 properly. Are there any license issues or other pitfalls we need to think
 about? I imagine that other companies have contributed before, so this is
 probably an already solved problem.
 
 I'm not a license lawyer, but typically your company needs to give the
 code to the community. Yes, it means it stops owning it.

As Simon pointed out, while some organisations do work that way, the PSF
isn't one of them.

The PSF only requires that the code be contributed under a license that
then allows us to turn around and redistribute it under a different open
source license without requesting additional permission from the
copyright holder. For corporate contributions, I believe the contributor
agreement needs to be signed by an authorised agent of the company - the
place to check that would probably be p...@python.org (that's the email
address for the PSF board).

Assuming the subject line relates to the code that you would like to
contribute though, that particular change is unlikely to happen - 2.6 is
in maintenance mode and changing RLock from a Python implementation to
the faster C one is solidly in new feature territory. Although a
backport of the 3.2 C RLock implementation to 2.7 could be useful, I
doubt that backporting code provided by an existing committer would be
the subject of this query :)

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Backported faster RLock to Python 2.6.

2010-01-07 Thread Lennart Regebro
On Thu, Jan 7, 2010 at 13:23, Nick Coghlan ncogh...@gmail.com wrote:
 As Simon pointed out, while some organisations do work that way, the PSF
 isn't one of them.

 The PSF only requires that the code be contributed under a license that
 then allows us to turn around and redistribute it under a different open
 source license without requesting additional permission from the
 copyright holder.

Even if the contributed code as in this case is a method in an
existing file? How does that work, how do they keep ownership of one
method in the threading.py method? :-)

 Assuming the subject line relates to the code that you would like to
 contribute though, that particular change is unlikely to happen - 2.6 is
 in maintenance mode and changing RLock from a Python implementation to
 the faster C one is solidly in new feature territory. Although a
 backport of the 3.2 C RLock implementation to 2.7 could be useful, I
 doubt that backporting code provided by an existing committer would be
 the subject of this query :)

Ah. I probably misunderstood what the suggested contribution was.
Maybe it was a separate file, which I didn't get.

-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Backported faster RLock to Python 2.6.

2010-01-07 Thread Michael Foord

On 07/01/2010 13:11, Lennart Regebro wrote:

On Thu, Jan 7, 2010 at 13:23, Nick Coghlanncogh...@gmail.com  wrote:
   

As Simon pointed out, while some organisations do work that way, the PSF
isn't one of them.

The PSF only requires that the code be contributed under a license that
then allows us to turn around and redistribute it under a different open
source license without requesting additional permission from the
copyright holder.
 

Even if the contributed code as in this case is a method in an
existing file? How does that work, how do they keep ownership of one
method in the threading.py method? :-)

   


When contributing code to Python all work remains copyright the original 
author. The combined work is copyright *everyone*. The PSF has a license 
to distribute it, which is all that is important.


How you meaningfully exercise your ownership over chunks of code is left 
for the reader to determine...


(i.e. copyright and ownership are legal terms that don't necessarily 
mean anything *practical* in these situations.)


Michael


--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] GIL required for _all_ Python calls?

2010-01-07 Thread Stefan Behnel

Guido van Rossum, 07.01.2010 05:29:

A better rule would be you may access the memory buffer in a PyString
or PyUnicode object with the GIL released as long as you own a
reference to the string object. Everything else is out of bounds (or
not worth the bother).


Is that a yes regarding the OP's original question about releasing the 
GIL during regexp searches?


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Backported faster RLock to Python 2.6.

2010-01-07 Thread Lennart Regebro
On Thu, Jan 7, 2010 at 14:15, Michael Foord fuzzy...@voidspace.org.uk wrote:
 (i.e. copyright and ownership are legal terms that don't necessarily mean
 anything *practical* in these situations.)

OK, fair enough. :-)
-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] GIL required for _all_ Python calls?

2010-01-07 Thread Antoine Pitrou
MRAB python at mrabarnett.plus.com writes:
 
 I know that it needs to have the GIL during memory-management calls, but
 does it for calls like Py_UNICODE_TOLOWER or PyErr_SetString? Is there
 an easy way to find out?

There is no easy way to do so. The only safe way is to examine all the
functions or macros you want to call with the GIL released, and assess whether
it is safe to call them. As already pointed out, no reference count should be
changed, and generally no mutable container should be accessed, except if that
container is known not to be referenced anywhere else (that would be the case
for e.g. a list that your function has created and is busy populating).

I agree that releasing the GIL when doing non-trivial regex searches is a
worthwhile research, so please don't give up immediately :-)

Regards

Antoine Pitrou.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Question over splitting unittest into a package

2010-01-07 Thread Olemis Lang
On Mon, Jan 4, 2010 at 9:24 AM, Olemis Lang ole...@gmail.com wrote:
 On Thu, Dec 31, 2009 at 10:30 AM, Martin (gzlist) gzl...@googlemail.com 
 wrote:
 Thanks for the quick response.

 On 30/12/2009, Benjamin Peterson benja...@python.org wrote:

 but maybe a
 discussion could start about a new, less hacky, way of doing the same


 I am strongly -1 for modifying the classes in «traditional» unittest
 module [2]_ (except that I am strongly +1 for the package structure
 WITHOUT TOUCHING anything else ...) , and the more I think about it I
 am more convinced ... but anyway, this not a big deal (and in the end
 what I think is not that relevant either ... :o). So ...


IOW, if I had all the freedom to implement it, after the pkg structure
I'd do something like :

{{{
#!python

class TestResult:
# Everything just the same
def _is_relevant_tb_level(self, tb):
return '__unittest' in tb.tb_frame.f_globals

class BetterTestResult(TestResult):
# Further code ... maybe ;o)
#
def _is_relevant_tb_level(self, tb):
# This or anything else you might want to do ;o)
#
globs = tb.tb_frame.f_globals
is_relevant =  '__name__' in globs and \
globs[__name__].startswith(unittest)
del globs
return is_relevant
}}}

that's what inheritance is for ;o) ... but quite probably that's not
gonna happen, just a comment .

-- 
Regards,

Olemis.

Blog ES: http://simelo-es.blogspot.com/
Blog EN: http://simelo-en.blogspot.com/

Featured article:
Ubuntu sustituye GIMP por F-Spot  -
http://feedproxy.google.com/~r/simelo-es/~3/-g48D6T6Ojs/ubuntu-sustituye-gimp-por-f-spot.html
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] GIL required for _all_ Python calls?

2010-01-07 Thread Martin v. Löwis
 A better rule would be you may access the memory buffer in a PyString
 or PyUnicode object with the GIL released as long as you own a
 reference to the string object. Everything else is out of bounds (or
 not worth the bother).
 
 Is that a yes regarding the OP's original question about releasing the
 GIL during regexp searches?

No, because the regex engine may also operate on buffers that start
moving around when you release the GIL.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] GIL required for _all_ Python calls?

2010-01-07 Thread Martin v. Löwis
 I've been wondering whether it's possible to release the GIL in the
 regex engine during matching.

I don't think that's possible. The regex engine can also operate on
objects whose representation may move in memory when you don't hold
the GIL (e.g. buffers that get mutated). Even if they stay in place -
if their contents changes, regex results may be confusing.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] relation between Python.asdl and Tools/compiler/ast.txt

2010-01-07 Thread Martin v. Löwis
 I would like to use astgen.py to generate python classes corresponding to the 
 AST of something I have defined in a .asdl file, along the line of what is
 apparently done for the python AST itself. I thought astgen.py would
 take as an argument a .asdl file, but apparently it instead process a file
 called ast.txt. Where does this file come from ? Is it generated from
 Python.asdl ?

astgen.py is not used to process asdl files; ast.txt lives right next to
astgen.py. Instead, the asdl file is processed by Parser/asdl_c.py.

HTH,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] GIL required for _all_ Python calls?

2010-01-07 Thread James Y Knight

On Jan 7, 2010, at 3:27 PM, Martin v. Löwis wrote:


I've been wondering whether it's possible to release the GIL in the
regex engine during matching.


I don't think that's possible. The regex engine can also operate on
objects whose representation may move in memory when you don't hold
the GIL (e.g. buffers that get mutated). Even if they stay in place -
if their contents changes, regex results may be confusing.


It seems probably worthwhile to optimize for the common case of using  
the regexp engine on an immutable object of type str or bytes, and  
allow releasing the GIL in *that* case, even if you have to keep it  
for the general case.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] GIL required for _all_ Python calls?

2010-01-07 Thread Martin v. Löwis
 I've been wondering whether it's possible to release the GIL in the
 regex engine during matching.

 I don't think that's possible. The regex engine can also operate on
 objects whose representation may move in memory when you don't hold
 the GIL (e.g. buffers that get mutated). Even if they stay in place -
 if their contents changes, regex results may be confusing.
 
 It seems probably worthwhile to optimize for the common case of using
 the regexp engine on an immutable object of type str or bytes, and
 allow releasing the GIL in *that* case, even if you have to keep it for
 the general case.

Right. This problem was the one that I thought of first.

Thinking about these things is fairly difficult (to me, at least), so
I think I could only tell whether I would consider a patch thread-safe
that released the GIL around matching under selected circumstances -
if I had the patch available. I don't see any obvious reason (assuming
Guido's list of conditions holds - i.e. you are holding references to
everything you access).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Backported faster RLock to Python 2.6.

2010-01-07 Thread Nick Coghlan
Johan Gill wrote:
 Yes, it is the new RLock implementation.
 If I understood this correctly, we should make a patch against trunk if
 anything should be contributed.

Yep.

 Do you mean that we wouldn't need the paperwork for backporting the
 original patch committed to py3k?

Whether or not a contributor agreement was essential for this particular
contribution would depend on how much new code was needed for the
backport, but the bulk of the copyright on the C RLock code would remain
with Antoine regardless.

However, sorting through the legalities of the contributor agreement
really is the best way to make sure every is squared away nice and
neatly from a legal point of view.

After all, even if I was a lawyer (which I'm not, I'm just a developer
with an interest in licensing issues), I still wouldn't be *your* lawyer :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] GIL required for _all_ Python calls?

2010-01-07 Thread Antoine Pitrou
Martin v. Löwis martin at v.loewis.de writes:
 
 I don't think that's possible. The regex engine can also operate on
 objects whose representation may move in memory when you don't hold
 the GIL (e.g. buffers that get mutated).

Why is it a problem? If we get a buffer through the new buffer API, the object
should ensure that the representation isn't moved away until the buffer is 
released.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] GIL required for _all_ Python calls?

2010-01-07 Thread Martin v. Löwis
 I've been wondering whether it's possible to release the GIL in the
 regex engine during matching.

Ok, here is another problem: SRE_OP_REPEAT uses PyObject_MALLOC,
which requires the GIL (it then also may call PyErr_NoMemory,
which also requires the GIL).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com



Re: [Python-Dev] Backported faster RLock to Python 2.6.

2010-01-07 Thread Johan Gill

On 01/07/2010 01:23 PM, Nick Coghlan wrote:

As Simon pointed out, while some organisations do work that way, the PSF
isn't one of them.

The PSF only requires that the code be contributed under a license that
then allows us to turn around and redistribute it under a different open
source license without requesting additional permission from the
copyright holder. For corporate contributions, I believe the contributor
agreement needs to be signed by an authorised agent of the company - the
place to check that would probably be p...@python.org (that's the email
address for the PSF board).

Assuming the subject line relates to the code that you would like to
contribute though, that particular change is unlikely to happen - 2.6 is
in maintenance mode and changing RLock from a Python implementation to
the faster C one is solidly in new feature territory. Although a
backport of the 3.2 C RLock implementation to 2.7 could be useful, I
doubt that backporting code provided by an existing committer would be
the subject of this query :)

Regards,
Nick.

   

Yes, it is the new RLock implementation.
If I understood this correctly, we should make a patch against trunk if 
anything should be contributed.
Do you mean that we wouldn't need the paperwork for backporting the 
original patch committed to py3k?


Regards
Johan Gill
Agama Technologies

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] relation between Python.asdl and Tools/compiler/ast.txt

2010-01-07 Thread Yoann Padioleau

On Jan 7, 2010, at 12:31 PM, Martin v. Löwis wrote:

 I would like to use astgen.py to generate python classes corresponding to 
 the 
 AST of something I have defined in a .asdl file, along the line of what is
 apparently done for the python AST itself. I thought astgen.py would
 take as an argument a .asdl file, but apparently it instead process a file
 called ast.txt. Where does this file come from ? Is it generated from
 Python.asdl ?
 
 astgen.py is not used to process asdl files; ast.txt lives right next to
 astgen.py. Instead, the asdl file is processed by Parser/asdl_c.py.

Yes, I know that. That's why I asked about the relation between ast.txt and 
Python.adsl.
If internally the parser uses the .adsl, but expose as a reflection mechanism 
things
that were generated from ast.txt, then there could be a mismatch. Where does 
ast.txt comes from ? Shouldn't it be generated itself from Python.adsl ?

So we would have

Python.adsl  ast.txt  astgen.py ---  ast.py containing all 
the UnarySub, Expression, classes that represents a Python AST.



 
 HTH,
 Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] GIL required for _all_ Python calls?

2010-01-07 Thread Martin v. Löwis
 I don't think that's possible. The regex engine can also operate on
 objects whose representation may move in memory when you don't hold
 the GIL (e.g. buffers that get mutated).
 
 Why is it a problem? If we get a buffer through the new buffer API, the object
 should ensure that the representation isn't moved away until the buffer is 
 released.

In 2.7, we currently get the buffer with bf_getreadbuffer. In 3.x, we have

/* Release the buffer immediately --- possibly dangerous
   but doing something else would require some re-factoring
*/
PyBuffer_Release(view);


Even if we do use the new API, and correctly, it still might be
confusing if the contents of the buffer changes underneath.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] relation between Python.asdl and Tools/compiler/ast.txt

2010-01-07 Thread Martin v. Löwis
 astgen.py is not used to process asdl files; ast.txt lives right
 next to astgen.py. Instead, the asdl file is processed by
 Parser/asdl_c.py.
 
 Yes, I know that. That's why I asked about the relation between
 ast.txt and Python.adsl. If internally the parser uses the .adsl, but
 expose as a reflection mechanism things that were generated from
 ast.txt, then there could be a mismatch. Where does ast.txt comes
 from ? Shouldn't it be generated itself from Python.adsl ?

What you may not be aware of is that Tools/compiler (and the
compiler package that it builds on) are both unused and unmaintained.

If the package stops working correctly - tough luck.

 So we would have
 
 Python.adsl  ast.txt  astgen.py ---  ast.py
 containing all the UnarySub, Expression, classes that represents a
 Python AST.

No - what actually happens in Python 3.x is this: both the compiler
package and Tools/compiler are removed.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Victor Stinner
Hi,

Builtin open() function is unable to open an UTF-16/32 file starting with a 
BOM if the encoding is not specified (raise an unicode error). For an UTF-8 
file starting with a BOM, read()/readline() returns also the BOM whereas the 
BOM should be ignored.

See recent issues related to reading an UTF-8 text file including a BOM: #7185 
(csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with 
the UTF-8-SIG encoding, but it's possible to do better.

I propose to improve open() (TextIOWrapper) by using the BOM to choose the 
right encoding. I think that only files opened in read only mode should 
support this new feature. *Read* the BOM in a *write* only file would cause 
unexpected behaviours.

Since my proposition changes the result TextIOWrapper.read()/readline() for 
files starting with a BOM, we might introduce an option to open() to enable 
the new behaviour. But is it really needed to keep the backward compatibility?

I wrote a proof of concept attached to the issue #7651. My patch only changes 
the behaviour of TextIOWrapper for reading files starting with a BOM. It 
doesn't work yet if a seek() is used before the first read.

-- 
Victor Stinner
http://www.haypocalc.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Guido van Rossum
I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
talk. And for the other two, perhaps it would make more sense to have
a separate encoding-guessing function that takes a binary stream and
returns a text stream wrapping it with the proper encoding?

--Guido

On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
victor.stin...@haypocalc.com wrote:
 Hi,

 Builtin open() function is unable to open an UTF-16/32 file starting with a
 BOM if the encoding is not specified (raise an unicode error). For an UTF-8
 file starting with a BOM, read()/readline() returns also the BOM whereas the
 BOM should be ignored.

 See recent issues related to reading an UTF-8 text file including a BOM: #7185
 (csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with
 the UTF-8-SIG encoding, but it's possible to do better.

 I propose to improve open() (TextIOWrapper) by using the BOM to choose the
 right encoding. I think that only files opened in read only mode should
 support this new feature. *Read* the BOM in a *write* only file would cause
 unexpected behaviours.

 Since my proposition changes the result TextIOWrapper.read()/readline() for
 files starting with a BOM, we might introduce an option to open() to enable
 the new behaviour. But is it really needed to keep the backward compatibility?

 I wrote a proof of concept attached to the issue #7651. My patch only changes
 the behaviour of TextIOWrapper for reading files starting with a BOM. It
 doesn't work yet if a seek() is used before the first read.

 --
 Victor Stinner
 http://www.haypocalc.com/
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/guido%40python.org




-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread MRAB

Guido van Rossum wrote:

I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
talk. And for the other two, perhaps it would make more sense to have
a separate encoding-guessing function that takes a binary stream and
returns a text stream wrapping it with the proper encoding?

Alternatively, have a universal UTF-8/16/32 encoding, ie one that 
expects UTF-8,

with or without BOM, or UTF-16/32 with BOM.


On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
victor.stin...@haypocalc.com wrote:

Hi,

Builtin open() function is unable to open an UTF-16/32 file starting with a
BOM if the encoding is not specified (raise an unicode error). For an UTF-8
file starting with a BOM, read()/readline() returns also the BOM whereas the
BOM should be ignored.

See recent issues related to reading an UTF-8 text file including a BOM: #7185
(csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with
the UTF-8-SIG encoding, but it's possible to do better.

I propose to improve open() (TextIOWrapper) by using the BOM to choose the
right encoding. I think that only files opened in read only mode should
support this new feature. *Read* the BOM in a *write* only file would cause
unexpected behaviours.

Since my proposition changes the result TextIOWrapper.read()/readline() for
files starting with a BOM, we might introduce an option to open() to enable
the new behaviour. But is it really needed to keep the backward compatibility?

I wrote a proof of concept attached to the issue #7651. My patch only changes
the behaviour of TextIOWrapper for reading files starting with a BOM. It
doesn't work yet if a seek() is used before the first read.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] --enabled-shared broken on freebsd5?

2010-01-07 Thread Nicholas Bastin
I think this problem probably needs to move over to distutils-sig, as
it doesn't seem to be specific to the way that Python itself uses
distutils.  distutils.command.build_ext tests for Py_ENABLE_SHARED on
linux and solaris and automatically adds '.' to the library_dirs, and
I suspect it just needs to do this on FreeBSD as well (adding bsd to
the list of platforms for which this is performed solves the
problem, but I don't pretend to know enough about either distutils or
freebsd to determine if this is the correct solution).

--
Nick
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Glyph Lefkowitz


On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:

 On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
 victor.stin...@haypocalc.com wrote:
 Hi,
 
 Builtin open() function is unable to open an UTF-16/32 file starting with a
 BOM if the encoding is not specified (raise an unicode error). For an UTF-8
 file starting with a BOM, read()/readline() returns also the BOM whereas the
 BOM should be ignored.

 I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
 talk. And for the other two, perhaps it would make more sense to have
 a separate encoding-guessing function that takes a binary stream and
 returns a text stream wrapping it with the proper encoding?

It *is* crazy, but unfortunately rather common.  Wikipedia has a good 
description of the issues: 
http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark.  Basically, some Windows 
text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, 
so it's become a convention to do that.  That's not good enough, so you need to 
guess the encoding as well to make sure, but if there is a BOM and you can 
otherwise verify that the file is probably UTF-8 encoded, you should discard it.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] --enabled-shared broken on freebsd5?

2010-01-07 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Nicholas Bastin wrote:
 I think this problem probably needs to move over to distutils-sig, as
 it doesn't seem to be specific to the way that Python itself uses
 distutils.  distutils.command.build_ext tests for Py_ENABLE_SHARED on
 linux and solaris and automatically adds '.' to the library_dirs, and
 I suspect it just needs to do this on FreeBSD as well (adding bsd to
 the list of platforms for which this is performed solves the
 problem, but I don't pretend to know enough about either distutils or
 freebsd to determine if this is the correct solution).

I wouldn't say it needed discussion on the SIG:  just create a bug
report, with the tentative patch you have worked out, and get it
assigned to Tarek.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktGsdQACgkQ+gerLs4ltQ5BMQCgtV8snMXH/6dDwgdN4sIJljLd
koYAoKq6c0tKsRSrITHcygu4Od9FVzF5
=BJaE
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Guido van Rossum
On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz gl...@twistedmatrix.com wrote:


 On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:

 On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
 victor.stin...@haypocalc.com wrote:

 Hi,

 Builtin open() function is unable to open an UTF-16/32 file starting with a

 BOM if the encoding is not specified (raise an unicode error). For an UTF-8

 file starting with a BOM, read()/readline() returns also the BOM whereas the

 BOM should be ignored.

 I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
 talk. And for the other two, perhaps it would make more sense to have
 a separate encoding-guessing function that takes a binary stream and
 returns a text stream wrapping it with the proper encoding?

 It *is* crazy, but unfortunately rather common.  Wikipedia has a good
 description of the issues:
 http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark.  Basically, some
 Windows text APIs will emit a UTF-8 BOM in order to identify the file as
 being UTF-8, so it's become a convention to do that.  That's not good
 enough, so you need to guess the encoding as well to make sure, but if there
 is a BOM and you can otherwise verify that the file is probably UTF-8
 encoded, you should discard it.

That doesn't make sense. If the file isn't UTF-8 you can't see the
BOM, because the BOM itself is UTF-8-encoded.

(And yes, I know this happens. Doesn't mean we need to auto-guess by
default; there are lots of issues e.g. what should happen after
seeking to offset 0?)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Stephen J. Turnbull
Guido van Rossum writes:

  I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
  talk.

That doesn't stop many applications from doing it.  Python should
perhapswink,nudge not produce UTF-8 + BOM without a disclaimer of
indemnification against all resulting damage, signed in blood, from
the user for each instance.

But it should do something sane when reading such files.  I can't
really see any harm in throwing it away, especially since use of
ZERO-WIDTH NO-BREAK SPACE as a joining character has been deprecated
IIRC.




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Guido van Rossum wrote:
 On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz gl...@twistedmatrix.com 
 wrote:

 On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:

 On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
 victor.stin...@haypocalc.com wrote:

 Hi,

 Builtin open() function is unable to open an UTF-16/32 file starting with a

 BOM if the encoding is not specified (raise an unicode error). For an UTF-8

 file starting with a BOM, read()/readline() returns also the BOM whereas the

 BOM should be ignored.

 I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
 talk. And for the other two, perhaps it would make more sense to have
 a separate encoding-guessing function that takes a binary stream and
 returns a text stream wrapping it with the proper encoding?

 It *is* crazy, but unfortunately rather common.  Wikipedia has a good
 description of the issues:
 http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark.  Basically, some
 Windows text APIs will emit a UTF-8 BOM in order to identify the file as
 being UTF-8, so it's become a convention to do that.  That's not good
 enough, so you need to guess the encoding as well to make sure, but if there
 is a BOM and you can otherwise verify that the file is probably UTF-8
 encoded, you should discard it.
 
 That doesn't make sense. If the file isn't UTF-8 you can't see the
 BOM, because the BOM itself is UTF-8-encoded.
 
 (And yes, I know this happens. Doesn't mean we need to auto-guess by
 default; there are lots of issues e.g. what should happen after
 seeking to offset 0?)

The BOM should not be seekeable if the file is opened with the proposed
guess encoding from BOM mode:  it isn't properly part of the stream at
all in that case.

A UTF-8 BOM is an absurditiy, but it exists *everywhere* in the wild:
Python would do wll to make it as easy as possible to consume such
files, as well as the non-insane versions (UTF-16 / UTF-32 BOMs).  In
the best of all possible worlds, I would just try opening the file so:

  f = open('/path/to/file', 'r', encoding=DWIFM)

and any BOM present would set the encoding for the remainder of the stream..



Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktGzLsACgkQ+gerLs4ltQ5+cwCdGfycPdj6+cPfD23vH644SpHL
sI0AoLGD7nfgMEJdJhBr90yjQQHfDgcJ
=js+2
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-07 Thread Glyph Lefkowitz

On Jan 7, 2010, at 11:21 PM, Guido van Rossum wrote:

 On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz gl...@twistedmatrix.com 
 wrote:
 
 On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:
 
 I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
 talk. And for the other two, perhaps it would make more sense to have
 a separate encoding-guessing function that takes a binary stream and
 returns a text stream wrapping it with the proper encoding?
 
 It *is* crazy, but unfortunately rather common.  Wikipedia has a good
 description of the issues:
 http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark.  Basically, some
 Windows text APIs will emit a UTF-8 BOM in order to identify the file as
 being UTF-8, so it's become a convention to do that.  That's not good
 enough, so you need to guess the encoding as well to make sure, but if there
 is a BOM and you can otherwise verify that the file is probably UTF-8
 encoded, you should discard it.
 
 That doesn't make sense. If the file isn't UTF-8 you can't see the
 BOM, because the BOM itself is UTF-8-encoded.

I'm saying that the BOM itself isn't enough to detect that the file is actually 
UTF-8.  If (for whatever reason: explicitly specified, guessed in some other 
way) the file's encoding is determined to be something else, the bytes 
comprising the BOM should be decoded as normal.  It's just that the UTF-8 
decoding of the BOM at the start of a file should be .

 (And yes, I know this happens. Doesn't mean we need to auto-guess by
 default; there are lots of issues e.g. what should happen after
 seeking to offset 0?)

I think it's pretty clear that the BOM should still be skipped in that case ...

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com