Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
On 02.03.13 22:32, Terry Reedy wrote: I am just curious: does 3.3 still intern (some) unicode chars? Did the 256 interned bytes of 2.x carry over to 3.x? Yes, Python 3 interns an empty string and first 256 Unicode characters. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
On 01.03.13 17:24, Stefan Bucur wrote: Before digging deeper into the issue, I wanted to ask here if there are any implicit assumptions about string identity and interning throughout the interpreter implementation. For instance, are two single-char strings having the same content supposed to be identical objects? I think this is not a bug if the code relies on the fact that an empty string is a singleton. This obviously is an immutable object and there is no public method to create different empty string. But a user can create different 1-character strings with same value (first create uninitialized a 1-character string and than fill a content). If some code fails when none of 1-character strings are interned, this obviously is a bug. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
2013/3/4 Serhiy Storchaka storch...@gmail.com On 01.03.13 17:24, Stefan Bucur wrote: Before digging deeper into the issue, I wanted to ask here if there are any implicit assumptions about string identity and interning throughout the interpreter implementation. For instance, are two single-char strings having the same content supposed to be identical objects? I think this is not a bug if the code relies on the fact that an empty string is a singleton. This obviously is an immutable object and there is no public method to create different empty string. Really? x = u'\xe9'.encode('ascii', 'ignore') x == '', x is '' (True, False) -- Amaury Forgeot d'Arc ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
On Mon, Mar 4, 2013 at 11:06 AM, Amaury Forgeot d'Arc amaur...@gmail.com wrote: 2013/3/4 Serhiy Storchaka storch...@gmail.com On 01.03.13 17:24, Stefan Bucur wrote: Before digging deeper into the issue, I wanted to ask here if there are any implicit assumptions about string identity and interning throughout the interpreter implementation. For instance, are two single-char strings having the same content supposed to be identical objects? I think this is not a bug if the code relies on the fact that an empty string is a singleton. This obviously is an immutable object and there is no public method to create different empty string. Really? x = u'\xe9'.encode('ascii', 'ignore') x == '', x is '' (True, False) Code that relies on this is incorrect (the language doesn't guarantee interning) but nevertheless given the intention of the implementation, that behavior of encode() is also a bug. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
2013/3/4 Guido van Rossum gu...@python.org x = u'\xe9'.encode('ascii', 'ignore') x == '', x is '' (True, False) Code that relies on this is incorrect (the language doesn't guarantee interning) but nevertheless given the intention of the implementation, that behavior of encode() is also a bug. The example above is obviously from python2.7; there is a similar example with python3.2: x = b'\xe9\xe9'.decode('ascii', 'ignore') x == '', x is '' (True, False) ...but this bug has been fixed in 3.3: PyUnicode_Resize() always returns the unicode_empty singleton. -- Amaury Forgeot d'Arc ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
Hi, 2013/3/4 Amaury Forgeot d'Arc amaur...@gmail.com: The example above is obviously from python2.7; there is a similar example with python3.2: x = b'\xe9\xe9'.decode('ascii', 'ignore') x == '', x is '' (True, False) ...but this bug has been fixed in 3.3: PyUnicode_Resize() always returns the unicode_empty singleton. Yeah, I tried to reuse singletons (empty string and latin-1 single letters) as much as possible to reduce memory footprint, not to ensure that an empty string is always the '' singleton. I wouldn't call this a bug. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
On Sat, Mar 2, 2013 at 1:24 AM, Stefan Bucur stefan.bu...@gmail.com wrote: Hi, I'm working on an automated bug finding tool that I'm trying to apply on the Python interpreter code (version 2.7.3). Because of early prototype limitations, I needed to disable string interning in stringobject.c. More precisely, I modified the PyString_FromStringAndSize and PyString_FromString to no longer check for the null and single-char cases, and create instead a new string every time (I can send the patch if needed). However, after applying this modification, when running make test I get a segfault in the test___all__ test case. Before digging deeper into the issue, I wanted to ask here if there are any implicit assumptions about string identity and interning throughout the interpreter implementation. For instance, are two single-char strings having the same content supposed to be identical objects? I'm assuming that it's either this, or some refcount bug in the interpreter that manifests only when certain strings are no longer interned and thus have a higher chance to get low refcount values. In theory, interning is supposed to be a pure optimisation, but it wouldn't surprise me if there are cases that assume the described strings are always interned (especially the null string case). Our test suite would never detect such bugs, as we never disable the interning. Whether or not we're interested in fixing such bugs would depend on the size of the patches needed to address them. From our point of view, such bugs are purely theoretical (as the assumption is always valid in an unpatched CPython build), so if the problem is too hard to diagnose or fix, we're more likely to declare that interning of at least those kinds of string values is required for correctness when creating modified versions of CPython. I'm not sure what kind of analyser you are writing, but if it relates to the CPython C API, you may be interested in https://gcc-python-plugin.readthedocs.org/en/latest/cpychecker.html Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
On Fri, 1 Mar 2013 16:24:42 +0100 Stefan Bucur stefan.bu...@gmail.com wrote: However, after applying this modification, when running make test I get a segfault in the test___all__ test case. Before digging deeper into the issue, I wanted to ask here if there are any implicit assumptions about string identity and interning throughout the interpreter implementation. For instance, are two single-char strings having the same content supposed to be identical objects? From a language POV, no, but inside a specific interpreter such as CPython it may be a reasonable expectation. I'm assuming that it's either this, or some refcount bug in the interpreter that manifests only when certain strings are no longer interned and thus have a higher chance to get low refcount values. Indeed, if it's a real bug it would be nice to get it fixed :-) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
On 3/2/2013 10:08 AM, Nick Coghlan wrote: On Sat, Mar 2, 2013 at 1:24 AM, Stefan Bucur stefan.bu...@gmail.com wrote: Hi, I'm working on an automated bug finding tool that I'm trying to apply on the Python interpreter code (version 2.7.3). Because of early prototype limitations, I needed to disable string interning in stringobject.c. More precisely, I modified the PyString_FromStringAndSize and PyString_FromString to no longer check for the null and single-char cases, and create instead a new string every time (I can send the patch if needed). However, after applying this modification, when running make test I get a segfault in the test___all__ test case. Before digging deeper into the issue, I wanted to ask here if there are any implicit assumptions about string identity and interning throughout the interpreter implementation. For instance, are two single-char strings having the same content supposed to be identical objects? I'm assuming that it's either this, or some refcount bug in the interpreter that manifests only when certain strings are no longer interned and thus have a higher chance to get low refcount values. In theory, interning is supposed to be a pure optimisation, but it wouldn't surprise me if there are cases that assume the described strings are always interned (especially the null string case). Our test suite would never detect such bugs, as we never disable the interning. Since it required patching functions rather than a configuration switch, it literally seems not be a supported option. If so, I would not consider it a bug for CPython to use the assumption of interning to run faster and I don't think it should be slowed down if that would be necessary to remove the assumption. (This is all assuming that the problem is not just a ref count bug.) Stefan's question was about 2.7. I am just curious: does 3.3 still intern (some) unicode chars? Did the 256 interned bytes of 2.x carry over to 3.x? Whether or not we're interested in fixing such bugs would depend on the size of the patches needed to address them. From our point of view, such bugs are purely theoretical (as the assumption is always valid in an unpatched CPython build), so if the problem is too hard to diagnose or fix, we're more likely to declare that interning of at least those kinds of string values is required for correctness when creating modified versions of CPython. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
On Sat, Mar 2, 2013 at 4:08 PM, Nick Coghlan ncogh...@gmail.com wrote: On Sat, Mar 2, 2013 at 1:24 AM, Stefan Bucur stefan.bu...@gmail.com wrote: Hi, I'm working on an automated bug finding tool that I'm trying to apply on the Python interpreter code (version 2.7.3). Because of early prototype limitations, I needed to disable string interning in stringobject.c. More precisely, I modified the PyString_FromStringAndSize and PyString_FromString to no longer check for the null and single-char cases, and create instead a new string every time (I can send the patch if needed). However, after applying this modification, when running make test I get a segfault in the test___all__ test case. Before digging deeper into the issue, I wanted to ask here if there are any implicit assumptions about string identity and interning throughout the interpreter implementation. For instance, are two single-char strings having the same content supposed to be identical objects? I'm assuming that it's either this, or some refcount bug in the interpreter that manifests only when certain strings are no longer interned and thus have a higher chance to get low refcount values. In theory, interning is supposed to be a pure optimisation, but it wouldn't surprise me if there are cases that assume the described strings are always interned (especially the null string case). Our test suite would never detect such bugs, as we never disable the interning. I understand. In this case, I'll further investigate the issue, and see what exactly is the cause of the crash. Whether or not we're interested in fixing such bugs would depend on the size of the patches needed to address them. From our point of view, such bugs are purely theoretical (as the assumption is always valid in an unpatched CPython build), so if the problem is too hard to diagnose or fix, we're more likely to declare that interning of at least those kinds of string values is required for correctness when creating modified versions of CPython. I'm not sure what kind of analyser you are writing, but if it relates to the CPython C API, you may be interested in https://gcc-python-plugin.readthedocs.org/en/latest/cpychecker.html That's quite a neat tool, I didn't know about it! I guess that would have saved me many hours of debugging obscure refcount bugs in my own Python extensions :) In any case, my analysis tool aims to find bugs in Python programs, not in the CPython implementation itself. It works by performing symbolic execution [1] on the Python interpreter, while it is executing the target Python program. This means that the Python interpreter memory space contains symbolic expressions (i.e., mathematical formulas over the program input) instead of concrete values. The interned strings are pesky for symbolic execution because the PyObject* pointer allocated when creating an interned string depends on the string contents, e.g., if the contents are already interned, the old pointer is returned, otherwise a new object is created. So the pointer itself becomes symbolic, i.e., dependant on the input data, which makes the analysis much more complicated. Stefan [1] http://en.wikipedia.org/wiki/Symbolic_execution ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
On Sat, Mar 2, 2013 at 4:31 PM, Antoine Pitrou solip...@pitrou.net wrote: On Fri, 1 Mar 2013 16:24:42 +0100 Stefan Bucur stefan.bu...@gmail.com wrote: However, after applying this modification, when running make test I get a segfault in the test___all__ test case. Before digging deeper into the issue, I wanted to ask here if there are any implicit assumptions about string identity and interning throughout the interpreter implementation. For instance, are two single-char strings having the same content supposed to be identical objects? From a language POV, no, but inside a specific interpreter such as CPython it may be a reasonable expectation. I'm assuming that it's either this, or some refcount bug in the interpreter that manifests only when certain strings are no longer interned and thus have a higher chance to get low refcount values. Indeed, if it's a real bug it would be nice to get it fixed :-) By the way, in that case, what would be the best way to debug such type of ref count errors? I recently ran across this document [1], which kind of applies to debugging focused on newly introduced code. But when some changes potentially impact a good fraction of the interpreter, where should I look first? I'm asking since I re-ran the failing test with gdb, and the segfault seems to occur when invoking the kill() syscall, so the error seems to manifest at some later point than when the faulty code is executed. Stefan [1] http://www.python.org/doc/essays/refcnt/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
Debugging a refcount bug? Good. Out of the door, line on the left, one cross each. 2013/3/2 Stefan Bucur stefan.bu...@gmail.com On Sat, Mar 2, 2013 at 4:31 PM, Antoine Pitrou solip...@pitrou.net wrote: On Fri, 1 Mar 2013 16:24:42 +0100 Stefan Bucur stefan.bu...@gmail.com wrote: However, after applying this modification, when running make test I get a segfault in the test___all__ test case. Before digging deeper into the issue, I wanted to ask here if there are any implicit assumptions about string identity and interning throughout the interpreter implementation. For instance, are two single-char strings having the same content supposed to be identical objects? From a language POV, no, but inside a specific interpreter such as CPython it may be a reasonable expectation. I'm assuming that it's either this, or some refcount bug in the interpreter that manifests only when certain strings are no longer interned and thus have a higher chance to get low refcount values. Indeed, if it's a real bug it would be nice to get it fixed :-) By the way, in that case, what would be the best way to debug such type of ref count errors? I recently ran across this document [1], which kind of applies to debugging focused on newly introduced code. But when some changes potentially impact a good fraction of the interpreter, where should I look first? I'm asking since I re-ran the failing test with gdb, and the segfault seems to occur when invoking the kill() syscall, so the error seems to manifest at some later point than when the faulty code is executed. Stefan [1] http://www.python.org/doc/essays/refcnt/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/lukas.lueg%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults
On Sat, 2 Mar 2013 22:13:56 +0100 Stefan Bucur stefan.bu...@gmail.com wrote: On Sat, Mar 2, 2013 at 4:31 PM, Antoine Pitrou solip...@pitrou.net wrote: On Fri, 1 Mar 2013 16:24:42 +0100 Stefan Bucur stefan.bu...@gmail.com wrote: However, after applying this modification, when running make test I get a segfault in the test___all__ test case. Before digging deeper into the issue, I wanted to ask here if there are any implicit assumptions about string identity and interning throughout the interpreter implementation. For instance, are two single-char strings having the same content supposed to be identical objects? From a language POV, no, but inside a specific interpreter such as CPython it may be a reasonable expectation. I'm assuming that it's either this, or some refcount bug in the interpreter that manifests only when certain strings are no longer interned and thus have a higher chance to get low refcount values. Indeed, if it's a real bug it would be nice to get it fixed :-) By the way, in that case, what would be the best way to debug such type of ref count errors? I recently ran across this document [1], which kind of applies to debugging focused on newly introduced code. That documents looks a bit outdated (1998!). I would suggest you enable core dumps (`ulimit -c unlimited`), then let Python crash and inspect the stack trace with gdb. You will get better results if using a debug build and the modern gdb inspection helpers: http://docs.python.org/devguide/gdb.html Oh, by the way, it would be better to do your work on Python 3 rather than 2.7. Either the `default` branch or the `3.3` branch, I guess. See http://docs.python.org/devguide/setup.html#checkout Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com