[Python-Dev] relation between Python.asdl and Tools/compiler/ast.txt
Hi, I would like to use astgen.py to generate python classes corresponding to the AST of something I have defined in a .asdl file, along the line of what is apparently done for the python AST itself. I thought astgen.py would take as an argument a .asdl file, but apparently it instead process a file called ast.txt. Where does this file come from ? Is it generated from Python.asdl ? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Backported faster RLock to Python 2.6.
Hi devs, the company where I work has done some work on Python, and the question is how this work, owned by the company, can be contributed to the community properly. Are there any license issues or other pitfalls we need to think about? I imagine that other companies have contributed before, so this is probably an already solved problem. Regards Johan Gill Agama Technologies ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Backported faster RLock to Python 2.6.
On Thu, Jan 7, 2010 at 10:46, Johan Gill johan.g...@agama.tv wrote: Hi devs, the company where I work has done some work on Python, and the question is how this work, owned by the company, can be contributed to the community properly. Are there any license issues or other pitfalls we need to think about? I imagine that other companies have contributed before, so this is probably an already solved problem. I'm not a license lawyer, but typically your company needs to give the code to the community. Yes, it means it stops owning it. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Backported faster RLock to Python 2.6.
On Thu, Jan 7, 2010 at 1:12 PM, Lennart Regebro rege...@gmail.com wrote: I'm not a license lawyer, but typically your company needs to give the code to the community. Yes, it means it stops owning it. This is incorrect. The correct information is at http://www.python.org/psf/contrib/. Schiavo Simon ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] test_ctypes failure on AIX 5.3 using python-2.6.2 and libffi-3.0.9
Hi All, I built the python-2.6.2 with the latest libffi-3.0.9 in AIX 5.3 using xlc compiler. When i try to run the ctypes test cases, two failures are seen in test_bitfields. *test_ints (ctypes.test.test_bitfields.C_Test) ... FAIL test_shorts (ctypes.test.test_bitfields.C_Test) ... FAIL* I have attached the full test case result. If i change the type c_int and c_short to c_unit and c_ushort of class BITS(Structure) in file test_bitfields.py then no failures are seen. Has anyone faced the similar issue or any help is appreciated. Thanks, Sangamesh ctype-testcases Description: Binary data ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Backported faster RLock to Python 2.6.
Lennart Regebro wrote: On Thu, Jan 7, 2010 at 10:46, Johan Gill johan.g...@agama.tv wrote: Hi devs, the company where I work has done some work on Python, and the question is how this work, owned by the company, can be contributed to the community properly. Are there any license issues or other pitfalls we need to think about? I imagine that other companies have contributed before, so this is probably an already solved problem. I'm not a license lawyer, but typically your company needs to give the code to the community. Yes, it means it stops owning it. As Simon pointed out, while some organisations do work that way, the PSF isn't one of them. The PSF only requires that the code be contributed under a license that then allows us to turn around and redistribute it under a different open source license without requesting additional permission from the copyright holder. For corporate contributions, I believe the contributor agreement needs to be signed by an authorised agent of the company - the place to check that would probably be p...@python.org (that's the email address for the PSF board). Assuming the subject line relates to the code that you would like to contribute though, that particular change is unlikely to happen - 2.6 is in maintenance mode and changing RLock from a Python implementation to the faster C one is solidly in new feature territory. Although a backport of the 3.2 C RLock implementation to 2.7 could be useful, I doubt that backporting code provided by an existing committer would be the subject of this query :) Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Backported faster RLock to Python 2.6.
On Thu, Jan 7, 2010 at 13:23, Nick Coghlan ncogh...@gmail.com wrote: As Simon pointed out, while some organisations do work that way, the PSF isn't one of them. The PSF only requires that the code be contributed under a license that then allows us to turn around and redistribute it under a different open source license without requesting additional permission from the copyright holder. Even if the contributed code as in this case is a method in an existing file? How does that work, how do they keep ownership of one method in the threading.py method? :-) Assuming the subject line relates to the code that you would like to contribute though, that particular change is unlikely to happen - 2.6 is in maintenance mode and changing RLock from a Python implementation to the faster C one is solidly in new feature territory. Although a backport of the 3.2 C RLock implementation to 2.7 could be useful, I doubt that backporting code provided by an existing committer would be the subject of this query :) Ah. I probably misunderstood what the suggested contribution was. Maybe it was a separate file, which I didn't get. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Backported faster RLock to Python 2.6.
On 07/01/2010 13:11, Lennart Regebro wrote: On Thu, Jan 7, 2010 at 13:23, Nick Coghlanncogh...@gmail.com wrote: As Simon pointed out, while some organisations do work that way, the PSF isn't one of them. The PSF only requires that the code be contributed under a license that then allows us to turn around and redistribute it under a different open source license without requesting additional permission from the copyright holder. Even if the contributed code as in this case is a method in an existing file? How does that work, how do they keep ownership of one method in the threading.py method? :-) When contributing code to Python all work remains copyright the original author. The combined work is copyright *everyone*. The PSF has a license to distribute it, which is all that is important. How you meaningfully exercise your ownership over chunks of code is left for the reader to determine... (i.e. copyright and ownership are legal terms that don't necessarily mean anything *practical* in these situations.) Michael -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL required for _all_ Python calls?
Guido van Rossum, 07.01.2010 05:29: A better rule would be you may access the memory buffer in a PyString or PyUnicode object with the GIL released as long as you own a reference to the string object. Everything else is out of bounds (or not worth the bother). Is that a yes regarding the OP's original question about releasing the GIL during regexp searches? Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Backported faster RLock to Python 2.6.
On Thu, Jan 7, 2010 at 14:15, Michael Foord fuzzy...@voidspace.org.uk wrote: (i.e. copyright and ownership are legal terms that don't necessarily mean anything *practical* in these situations.) OK, fair enough. :-) -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL required for _all_ Python calls?
MRAB python at mrabarnett.plus.com writes: I know that it needs to have the GIL during memory-management calls, but does it for calls like Py_UNICODE_TOLOWER or PyErr_SetString? Is there an easy way to find out? There is no easy way to do so. The only safe way is to examine all the functions or macros you want to call with the GIL released, and assess whether it is safe to call them. As already pointed out, no reference count should be changed, and generally no mutable container should be accessed, except if that container is known not to be referenced anywhere else (that would be the case for e.g. a list that your function has created and is busy populating). I agree that releasing the GIL when doing non-trivial regex searches is a worthwhile research, so please don't give up immediately :-) Regards Antoine Pitrou. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Question over splitting unittest into a package
On Mon, Jan 4, 2010 at 9:24 AM, Olemis Lang ole...@gmail.com wrote: On Thu, Dec 31, 2009 at 10:30 AM, Martin (gzlist) gzl...@googlemail.com wrote: Thanks for the quick response. On 30/12/2009, Benjamin Peterson benja...@python.org wrote: but maybe a discussion could start about a new, less hacky, way of doing the same I am strongly -1 for modifying the classes in «traditional» unittest module [2]_ (except that I am strongly +1 for the package structure WITHOUT TOUCHING anything else ...) , and the more I think about it I am more convinced ... but anyway, this not a big deal (and in the end what I think is not that relevant either ... :o). So ... IOW, if I had all the freedom to implement it, after the pkg structure I'd do something like : {{{ #!python class TestResult: # Everything just the same def _is_relevant_tb_level(self, tb): return '__unittest' in tb.tb_frame.f_globals class BetterTestResult(TestResult): # Further code ... maybe ;o) # def _is_relevant_tb_level(self, tb): # This or anything else you might want to do ;o) # globs = tb.tb_frame.f_globals is_relevant = '__name__' in globs and \ globs[__name__].startswith(unittest) del globs return is_relevant }}} that's what inheritance is for ;o) ... but quite probably that's not gonna happen, just a comment . -- Regards, Olemis. Blog ES: http://simelo-es.blogspot.com/ Blog EN: http://simelo-en.blogspot.com/ Featured article: Ubuntu sustituye GIMP por F-Spot - http://feedproxy.google.com/~r/simelo-es/~3/-g48D6T6Ojs/ubuntu-sustituye-gimp-por-f-spot.html ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL required for _all_ Python calls?
A better rule would be you may access the memory buffer in a PyString or PyUnicode object with the GIL released as long as you own a reference to the string object. Everything else is out of bounds (or not worth the bother). Is that a yes regarding the OP's original question about releasing the GIL during regexp searches? No, because the regex engine may also operate on buffers that start moving around when you release the GIL. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL required for _all_ Python calls?
I've been wondering whether it's possible to release the GIL in the regex engine during matching. I don't think that's possible. The regex engine can also operate on objects whose representation may move in memory when you don't hold the GIL (e.g. buffers that get mutated). Even if they stay in place - if their contents changes, regex results may be confusing. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] relation between Python.asdl and Tools/compiler/ast.txt
I would like to use astgen.py to generate python classes corresponding to the AST of something I have defined in a .asdl file, along the line of what is apparently done for the python AST itself. I thought astgen.py would take as an argument a .asdl file, but apparently it instead process a file called ast.txt. Where does this file come from ? Is it generated from Python.asdl ? astgen.py is not used to process asdl files; ast.txt lives right next to astgen.py. Instead, the asdl file is processed by Parser/asdl_c.py. HTH, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL required for _all_ Python calls?
On Jan 7, 2010, at 3:27 PM, Martin v. Löwis wrote: I've been wondering whether it's possible to release the GIL in the regex engine during matching. I don't think that's possible. The regex engine can also operate on objects whose representation may move in memory when you don't hold the GIL (e.g. buffers that get mutated). Even if they stay in place - if their contents changes, regex results may be confusing. It seems probably worthwhile to optimize for the common case of using the regexp engine on an immutable object of type str or bytes, and allow releasing the GIL in *that* case, even if you have to keep it for the general case. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL required for _all_ Python calls?
I've been wondering whether it's possible to release the GIL in the regex engine during matching. I don't think that's possible. The regex engine can also operate on objects whose representation may move in memory when you don't hold the GIL (e.g. buffers that get mutated). Even if they stay in place - if their contents changes, regex results may be confusing. It seems probably worthwhile to optimize for the common case of using the regexp engine on an immutable object of type str or bytes, and allow releasing the GIL in *that* case, even if you have to keep it for the general case. Right. This problem was the one that I thought of first. Thinking about these things is fairly difficult (to me, at least), so I think I could only tell whether I would consider a patch thread-safe that released the GIL around matching under selected circumstances - if I had the patch available. I don't see any obvious reason (assuming Guido's list of conditions holds - i.e. you are holding references to everything you access). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Backported faster RLock to Python 2.6.
Johan Gill wrote: Yes, it is the new RLock implementation. If I understood this correctly, we should make a patch against trunk if anything should be contributed. Yep. Do you mean that we wouldn't need the paperwork for backporting the original patch committed to py3k? Whether or not a contributor agreement was essential for this particular contribution would depend on how much new code was needed for the backport, but the bulk of the copyright on the C RLock code would remain with Antoine regardless. However, sorting through the legalities of the contributor agreement really is the best way to make sure every is squared away nice and neatly from a legal point of view. After all, even if I was a lawyer (which I'm not, I'm just a developer with an interest in licensing issues), I still wouldn't be *your* lawyer :) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL required for _all_ Python calls?
Martin v. Löwis martin at v.loewis.de writes: I don't think that's possible. The regex engine can also operate on objects whose representation may move in memory when you don't hold the GIL (e.g. buffers that get mutated). Why is it a problem? If we get a buffer through the new buffer API, the object should ensure that the representation isn't moved away until the buffer is released. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL required for _all_ Python calls?
I've been wondering whether it's possible to release the GIL in the regex engine during matching. Ok, here is another problem: SRE_OP_REPEAT uses PyObject_MALLOC, which requires the GIL (it then also may call PyErr_NoMemory, which also requires the GIL). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Backported faster RLock to Python 2.6.
On 01/07/2010 01:23 PM, Nick Coghlan wrote: As Simon pointed out, while some organisations do work that way, the PSF isn't one of them. The PSF only requires that the code be contributed under a license that then allows us to turn around and redistribute it under a different open source license without requesting additional permission from the copyright holder. For corporate contributions, I believe the contributor agreement needs to be signed by an authorised agent of the company - the place to check that would probably be p...@python.org (that's the email address for the PSF board). Assuming the subject line relates to the code that you would like to contribute though, that particular change is unlikely to happen - 2.6 is in maintenance mode and changing RLock from a Python implementation to the faster C one is solidly in new feature territory. Although a backport of the 3.2 C RLock implementation to 2.7 could be useful, I doubt that backporting code provided by an existing committer would be the subject of this query :) Regards, Nick. Yes, it is the new RLock implementation. If I understood this correctly, we should make a patch against trunk if anything should be contributed. Do you mean that we wouldn't need the paperwork for backporting the original patch committed to py3k? Regards Johan Gill Agama Technologies ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] relation between Python.asdl and Tools/compiler/ast.txt
On Jan 7, 2010, at 12:31 PM, Martin v. Löwis wrote: I would like to use astgen.py to generate python classes corresponding to the AST of something I have defined in a .asdl file, along the line of what is apparently done for the python AST itself. I thought astgen.py would take as an argument a .asdl file, but apparently it instead process a file called ast.txt. Where does this file come from ? Is it generated from Python.asdl ? astgen.py is not used to process asdl files; ast.txt lives right next to astgen.py. Instead, the asdl file is processed by Parser/asdl_c.py. Yes, I know that. That's why I asked about the relation between ast.txt and Python.adsl. If internally the parser uses the .adsl, but expose as a reflection mechanism things that were generated from ast.txt, then there could be a mismatch. Where does ast.txt comes from ? Shouldn't it be generated itself from Python.adsl ? So we would have Python.adsl ast.txt astgen.py --- ast.py containing all the UnarySub, Expression, classes that represents a Python AST. HTH, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] GIL required for _all_ Python calls?
I don't think that's possible. The regex engine can also operate on objects whose representation may move in memory when you don't hold the GIL (e.g. buffers that get mutated). Why is it a problem? If we get a buffer through the new buffer API, the object should ensure that the representation isn't moved away until the buffer is released. In 2.7, we currently get the buffer with bf_getreadbuffer. In 3.x, we have /* Release the buffer immediately --- possibly dangerous but doing something else would require some re-factoring */ PyBuffer_Release(view); Even if we do use the new API, and correctly, it still might be confusing if the contents of the buffer changes underneath. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] relation between Python.asdl and Tools/compiler/ast.txt
astgen.py is not used to process asdl files; ast.txt lives right next to astgen.py. Instead, the asdl file is processed by Parser/asdl_c.py. Yes, I know that. That's why I asked about the relation between ast.txt and Python.adsl. If internally the parser uses the .adsl, but expose as a reflection mechanism things that were generated from ast.txt, then there could be a mismatch. Where does ast.txt comes from ? Shouldn't it be generated itself from Python.adsl ? What you may not be aware of is that Tools/compiler (and the compiler package that it builds on) are both unused and unmaintained. If the package stops working correctly - tough luck. So we would have Python.adsl ast.txt astgen.py --- ast.py containing all the UnarySub, Expression, classes that represents a Python AST. No - what actually happens in Python 3.x is this: both the compiler package and Tools/compiler are removed. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Improve open() to support reading file starting with an unicode BOM
Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. See recent issues related to reading an UTF-8 text file including a BOM: #7185 (csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with the UTF-8-SIG encoding, but it's possible to do better. I propose to improve open() (TextIOWrapper) by using the BOM to choose the right encoding. I think that only files opened in read only mode should support this new feature. *Read* the BOM in a *write* only file would cause unexpected behaviours. Since my proposition changes the result TextIOWrapper.read()/readline() for files starting with a BOM, we might introduce an option to open() to enable the new behaviour. But is it really needed to keep the backward compatibility? I wrote a proof of concept attached to the issue #7651. My patch only changes the behaviour of TextIOWrapper for reading files starting with a BOM. It doesn't work yet if a seek() is used before the first read. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? --Guido On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. See recent issues related to reading an UTF-8 text file including a BOM: #7185 (csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with the UTF-8-SIG encoding, but it's possible to do better. I propose to improve open() (TextIOWrapper) by using the BOM to choose the right encoding. I think that only files opened in read only mode should support this new feature. *Read* the BOM in a *write* only file would cause unexpected behaviours. Since my proposition changes the result TextIOWrapper.read()/readline() for files starting with a BOM, we might introduce an option to open() to enable the new behaviour. But is it really needed to keep the backward compatibility? I wrote a proof of concept attached to the issue #7651. My patch only changes the behaviour of TextIOWrapper for reading files starting with a BOM. It doesn't work yet if a seek() is used before the first read. -- Victor Stinner http://www.haypocalc.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Guido van Rossum wrote: I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? Alternatively, have a universal UTF-8/16/32 encoding, ie one that expects UTF-8, with or without BOM, or UTF-16/32 with BOM. On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. See recent issues related to reading an UTF-8 text file including a BOM: #7185 (csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with the UTF-8-SIG encoding, but it's possible to do better. I propose to improve open() (TextIOWrapper) by using the BOM to choose the right encoding. I think that only files opened in read only mode should support this new feature. *Read* the BOM in a *write* only file would cause unexpected behaviours. Since my proposition changes the result TextIOWrapper.read()/readline() for files starting with a BOM, we might introduce an option to open() to enable the new behaviour. But is it really needed to keep the backward compatibility? I wrote a proof of concept attached to the issue #7651. My patch only changes the behaviour of TextIOWrapper for reading files starting with a BOM. It doesn't work yet if a seek() is used before the first read. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] --enabled-shared broken on freebsd5?
I think this problem probably needs to move over to distutils-sig, as it doesn't seem to be specific to the way that Python itself uses distutils. distutils.command.build_ext tests for Py_ENABLE_SHARED on linux and solaris and automatically adds '.' to the library_dirs, and I suspect it just needs to do this on FreeBSD as well (adding bsd to the list of platforms for which this is performed solves the problem, but I don't pretend to know enough about either distutils or freebsd to determine if this is the correct solution). -- Nick ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? It *is* crazy, but unfortunately rather common. Wikipedia has a good description of the issues: http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark. Basically, some Windows text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] --enabled-shared broken on freebsd5?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Nicholas Bastin wrote: I think this problem probably needs to move over to distutils-sig, as it doesn't seem to be specific to the way that Python itself uses distutils. distutils.command.build_ext tests for Py_ENABLE_SHARED on linux and solaris and automatically adds '.' to the library_dirs, and I suspect it just needs to do this on FreeBSD as well (adding bsd to the list of platforms for which this is performed solves the problem, but I don't pretend to know enough about either distutils or freebsd to determine if this is the correct solution). I wouldn't say it needed discussion on the SIG: just create a bug report, with the tentative patch you have worked out, and get it assigned to Tarek. Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktGsdQACgkQ+gerLs4ltQ5BMQCgtV8snMXH/6dDwgdN4sIJljLd koYAoKq6c0tKsRSrITHcygu4Od9FVzF5 =BJaE -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz gl...@twistedmatrix.com wrote: On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? It *is* crazy, but unfortunately rather common. Wikipedia has a good description of the issues: http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark. Basically, some Windows text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. (And yes, I know this happens. Doesn't mean we need to auto-guess by default; there are lots of issues e.g. what should happen after seeking to offset 0?) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Guido van Rossum writes: I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. That doesn't stop many applications from doing it. Python should perhapswink,nudge not produce UTF-8 + BOM without a disclaimer of indemnification against all resulting damage, signed in blood, from the user for each instance. But it should do something sane when reading such files. I can't really see any harm in throwing it away, especially since use of ZERO-WIDTH NO-BREAK SPACE as a joining character has been deprecated IIRC. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Guido van Rossum wrote: On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz gl...@twistedmatrix.com wrote: On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be ignored. I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? It *is* crazy, but unfortunately rather common. Wikipedia has a good description of the issues: http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark. Basically, some Windows text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. (And yes, I know this happens. Doesn't mean we need to auto-guess by default; there are lots of issues e.g. what should happen after seeking to offset 0?) The BOM should not be seekeable if the file is opened with the proposed guess encoding from BOM mode: it isn't properly part of the stream at all in that case. A UTF-8 BOM is an absurditiy, but it exists *everywhere* in the wild: Python would do wll to make it as easy as possible to consume such files, as well as the non-insane versions (UTF-16 / UTF-32 BOMs). In the best of all possible worlds, I would just try opening the file so: f = open('/path/to/file', 'r', encoding=DWIFM) and any BOM present would set the encoding for the remainder of the stream.. Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktGzLsACgkQ+gerLs4ltQ5+cwCdGfycPdj6+cPfD23vH644SpHL sI0AoLGD7nfgMEJdJhBr90yjQQHfDgcJ =js+2 -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
On Jan 7, 2010, at 11:21 PM, Guido van Rossum wrote: On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz gl...@twistedmatrix.com wrote: On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? It *is* crazy, but unfortunately rather common. Wikipedia has a good description of the issues: http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark. Basically, some Windows text APIs will emit a UTF-8 BOM in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. I'm saying that the BOM itself isn't enough to detect that the file is actually UTF-8. If (for whatever reason: explicitly specified, guessed in some other way) the file's encoding is determined to be something else, the bytes comprising the BOM should be decoded as normal. It's just that the UTF-8 decoding of the BOM at the start of a file should be . (And yes, I know this happens. Doesn't mean we need to auto-guess by default; there are lots of issues e.g. what should happen after seeking to offset 0?) I think it's pretty clear that the BOM should still be skipped in that case ... ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com