Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-04 Thread Serhiy Storchaka

On 02.03.13 22:32, Terry Reedy wrote:

I am just curious: does 3.3 still
intern (some) unicode chars? Did the 256 interned bytes of 2.x carry
over to 3.x?


Yes, Python 3 interns an empty string and first 256 Unicode characters.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-04 Thread Serhiy Storchaka

On 01.03.13 17:24, Stefan Bucur wrote:

Before digging deeper into the issue, I wanted to ask here if there are
any implicit assumptions about string identity and interning throughout
the interpreter implementation. For instance, are two single-char
strings having the same content supposed to be identical objects?


I think this is not a bug if the code relies on the fact that an empty 
string is a singleton. This obviously is an immutable object and there 
is no public method to create different empty string. But a user can 
create different 1-character strings with same value (first create 
uninitialized a 1-character string and than fill a content). If some 
code fails when none of 1-character strings are interned, this obviously 
is a bug.



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-04 Thread Amaury Forgeot d'Arc
2013/3/4 Serhiy Storchaka storch...@gmail.com

 On 01.03.13 17:24, Stefan Bucur wrote:

 Before digging deeper into the issue, I wanted to ask here if there are
 any implicit assumptions about string identity and interning throughout
 the interpreter implementation. For instance, are two single-char
 strings having the same content supposed to be identical objects?


 I think this is not a bug if the code relies on the fact that an empty
 string is a singleton. This obviously is an immutable object and there is
 no public method to create different empty string.


Really?

 x = u'\xe9'.encode('ascii', 'ignore')
 x == '', x is ''
(True, False)


-- 
Amaury Forgeot d'Arc
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-04 Thread Guido van Rossum
On Mon, Mar 4, 2013 at 11:06 AM, Amaury Forgeot d'Arc
amaur...@gmail.com wrote:


 2013/3/4 Serhiy Storchaka storch...@gmail.com

 On 01.03.13 17:24, Stefan Bucur wrote:

 Before digging deeper into the issue, I wanted to ask here if there are
 any implicit assumptions about string identity and interning throughout
 the interpreter implementation. For instance, are two single-char
 strings having the same content supposed to be identical objects?


 I think this is not a bug if the code relies on the fact that an empty
 string is a singleton. This obviously is an immutable object and there is no
 public method to create different empty string.


 Really?

 x = u'\xe9'.encode('ascii', 'ignore')
 x == '', x is ''
 (True, False)

Code that relies on this is incorrect (the language doesn't guarantee
interning) but nevertheless given the intention of the implementation,
that behavior of encode() is also a bug.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-04 Thread Amaury Forgeot d'Arc
2013/3/4 Guido van Rossum gu...@python.org

  x = u'\xe9'.encode('ascii', 'ignore')
  x == '', x is ''
  (True, False)

 Code that relies on this is incorrect (the language doesn't guarantee
 interning) but nevertheless given the intention of the implementation,
 that behavior of encode() is also a bug.


The example above is obviously from python2.7; there is a similar example
with python3.2:
 x = b'\xe9\xe9'.decode('ascii', 'ignore')
 x == '', x is ''
(True, False)

...but this bug has been fixed in 3.3: PyUnicode_Resize() always returns
the unicode_empty singleton.

-- 
Amaury Forgeot d'Arc
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-04 Thread Victor Stinner
Hi,

2013/3/4 Amaury Forgeot d'Arc amaur...@gmail.com:
 The example above is obviously from python2.7; there is a similar example
 with python3.2:
 x = b'\xe9\xe9'.decode('ascii', 'ignore')
 x == '', x is ''
 (True, False)

 ...but this bug has been fixed in 3.3: PyUnicode_Resize() always returns the
 unicode_empty singleton.

Yeah, I tried to reuse singletons (empty string and latin-1 single
letters) as much as possible to reduce memory footprint, not to ensure
that an empty string is always the '' singleton.

I wouldn't call this a bug.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-02 Thread Nick Coghlan
On Sat, Mar 2, 2013 at 1:24 AM, Stefan Bucur stefan.bu...@gmail.com wrote:
 Hi,

 I'm working on an automated bug finding tool that I'm trying to apply on the
 Python interpreter code (version 2.7.3). Because of early prototype
 limitations, I needed to disable string interning in stringobject.c. More
 precisely, I modified the PyString_FromStringAndSize and PyString_FromString
 to no longer check for the null and single-char cases, and create instead a
 new string every time (I can send the patch if needed).

 However, after applying this modification, when running make test I get a
 segfault in the test___all__ test case.

 Before digging deeper into the issue, I wanted to ask here if there are any
 implicit assumptions about string identity and interning throughout the
 interpreter implementation. For instance, are two single-char strings having
 the same content supposed to be identical objects?

 I'm assuming that it's either this, or some refcount bug in the interpreter
 that manifests only when certain strings are no longer interned and thus
 have a higher chance to get low refcount values.

In theory, interning is supposed to be a pure optimisation, but it
wouldn't surprise me if there are cases that assume the described
strings are always interned (especially the null string case). Our
test suite would never detect such bugs, as we never disable the
interning.

Whether or not we're interested in fixing such bugs would depend on
the size of the patches needed to address them. From our point of
view, such bugs are purely theoretical (as the assumption is always
valid in an unpatched CPython build), so if the problem is too hard to
diagnose or fix, we're more likely to declare that interning of at
least those kinds of string values is required for correctness when
creating modified versions of CPython.

I'm not sure what kind of analyser you are writing, but if it relates
to the CPython C API, you may be interested in
https://gcc-python-plugin.readthedocs.org/en/latest/cpychecker.html

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-02 Thread Antoine Pitrou
On Fri, 1 Mar 2013 16:24:42 +0100
Stefan Bucur stefan.bu...@gmail.com wrote:
 
 However, after applying this modification, when running make test I get a
 segfault in the test___all__ test case.
 
 Before digging deeper into the issue, I wanted to ask here if there are any
 implicit assumptions about string identity and interning throughout the
 interpreter implementation. For instance, are two single-char strings
 having the same content supposed to be identical objects?

From a language POV, no, but inside a specific interpreter such as
CPython it may be a reasonable expectation.

 I'm assuming that it's either this, or some refcount bug in the interpreter
 that manifests only when certain strings are no longer interned and thus
 have a higher chance to get low refcount values.

Indeed, if it's a real bug it would be nice to get it fixed :-)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-02 Thread Terry Reedy

On 3/2/2013 10:08 AM, Nick Coghlan wrote:

On Sat, Mar 2, 2013 at 1:24 AM, Stefan Bucur stefan.bu...@gmail.com wrote:

Hi,

I'm working on an automated bug finding tool that I'm trying to apply on the
Python interpreter code (version 2.7.3). Because of early prototype
limitations, I needed to disable string interning in stringobject.c. More
precisely, I modified the PyString_FromStringAndSize and PyString_FromString
to no longer check for the null and single-char cases, and create instead a
new string every time (I can send the patch if needed).

However, after applying this modification, when running make test I get a
segfault in the test___all__ test case.

Before digging deeper into the issue, I wanted to ask here if there are any
implicit assumptions about string identity and interning throughout the
interpreter implementation. For instance, are two single-char strings having
the same content supposed to be identical objects?

I'm assuming that it's either this, or some refcount bug in the interpreter
that manifests only when certain strings are no longer interned and thus
have a higher chance to get low refcount values.


In theory, interning is supposed to be a pure optimisation, but it
wouldn't surprise me if there are cases that assume the described
strings are always interned (especially the null string case). Our
test suite would never detect such bugs, as we never disable the
interning.


Since it required patching functions rather than a configuration switch, 
it literally seems not be a supported option. If so, I would not 
consider it a bug for CPython to use the assumption of interning to run 
faster and I don't think it should be slowed down if that would be 
necessary to remove the assumption. (This is all assuming that the 
problem is not just a ref count bug.)


Stefan's question was about 2.7. I am just curious: does 3.3 still 
intern (some) unicode chars? Did the 256 interned bytes of 2.x carry 
over to 3.x?



Whether or not we're interested in fixing such bugs would depend on
the size of the patches needed to address them. From our point of
view, such bugs are purely theoretical (as the assumption is always
valid in an unpatched CPython build), so if the problem is too hard to
diagnose or fix, we're more likely to declare that interning of at
least those kinds of string values is required for correctness when
creating modified versions of CPython.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-02 Thread Stefan Bucur
On Sat, Mar 2, 2013 at 4:08 PM, Nick Coghlan ncogh...@gmail.com wrote:
 On Sat, Mar 2, 2013 at 1:24 AM, Stefan Bucur stefan.bu...@gmail.com wrote:
 Hi,

 I'm working on an automated bug finding tool that I'm trying to apply on the
 Python interpreter code (version 2.7.3). Because of early prototype
 limitations, I needed to disable string interning in stringobject.c. More
 precisely, I modified the PyString_FromStringAndSize and PyString_FromString
 to no longer check for the null and single-char cases, and create instead a
 new string every time (I can send the patch if needed).

 However, after applying this modification, when running make test I get a
 segfault in the test___all__ test case.

 Before digging deeper into the issue, I wanted to ask here if there are any
 implicit assumptions about string identity and interning throughout the
 interpreter implementation. For instance, are two single-char strings having
 the same content supposed to be identical objects?

 I'm assuming that it's either this, or some refcount bug in the interpreter
 that manifests only when certain strings are no longer interned and thus
 have a higher chance to get low refcount values.

 In theory, interning is supposed to be a pure optimisation, but it
 wouldn't surprise me if there are cases that assume the described
 strings are always interned (especially the null string case). Our
 test suite would never detect such bugs, as we never disable the
 interning.

I understand. In this case, I'll further investigate the issue, and
see what exactly is the cause of the crash.


 Whether or not we're interested in fixing such bugs would depend on
 the size of the patches needed to address them. From our point of
 view, such bugs are purely theoretical (as the assumption is always
 valid in an unpatched CPython build), so if the problem is too hard to
 diagnose or fix, we're more likely to declare that interning of at
 least those kinds of string values is required for correctness when
 creating modified versions of CPython.

 I'm not sure what kind of analyser you are writing, but if it relates
 to the CPython C API, you may be interested in
 https://gcc-python-plugin.readthedocs.org/en/latest/cpychecker.html

That's quite a neat tool, I didn't know about it! I guess that would
have saved me many hours of debugging obscure refcount bugs in my own
Python extensions :)

In any case, my analysis tool aims to find bugs in Python programs,
not in the CPython implementation itself. It works by performing
symbolic execution [1] on the Python interpreter, while it is
executing the target Python program. This means that the Python
interpreter memory space contains symbolic expressions (i.e.,
mathematical formulas over the program input) instead of concrete
values.

The interned strings are pesky for symbolic execution because the
PyObject* pointer allocated when creating an interned string depends
on the string contents, e.g., if the contents are already interned,
the old pointer is returned, otherwise a new object is created. So the
pointer itself becomes symbolic, i.e., dependant on the input data,
which makes the analysis much more complicated.

Stefan

[1] http://en.wikipedia.org/wiki/Symbolic_execution
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-02 Thread Stefan Bucur
On Sat, Mar 2, 2013 at 4:31 PM, Antoine Pitrou solip...@pitrou.net wrote:
 On Fri, 1 Mar 2013 16:24:42 +0100
 Stefan Bucur stefan.bu...@gmail.com wrote:

 However, after applying this modification, when running make test I get a
 segfault in the test___all__ test case.

 Before digging deeper into the issue, I wanted to ask here if there are any
 implicit assumptions about string identity and interning throughout the
 interpreter implementation. For instance, are two single-char strings
 having the same content supposed to be identical objects?

 From a language POV, no, but inside a specific interpreter such as
 CPython it may be a reasonable expectation.

 I'm assuming that it's either this, or some refcount bug in the interpreter
 that manifests only when certain strings are no longer interned and thus
 have a higher chance to get low refcount values.

 Indeed, if it's a real bug it would be nice to get it fixed :-)

By the way, in that case, what would be the best way to debug such
type of ref count errors? I recently ran across this document [1],
which kind of applies to debugging focused on newly introduced code.
But when some changes potentially impact a good fraction of the
interpreter, where should I look first?

I'm asking since I re-ran the failing test with gdb, and the segfault
seems to occur when invoking the kill() syscall, so the error seems to
manifest at some later point than when the faulty code is executed.

Stefan

[1] http://www.python.org/doc/essays/refcnt/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-02 Thread Lukas Lueg
Debugging a refcount bug? Good. Out of the door, line on the left, one
cross each.


2013/3/2 Stefan Bucur stefan.bu...@gmail.com

 On Sat, Mar 2, 2013 at 4:31 PM, Antoine Pitrou solip...@pitrou.net
 wrote:
  On Fri, 1 Mar 2013 16:24:42 +0100
  Stefan Bucur stefan.bu...@gmail.com wrote:
 
  However, after applying this modification, when running make test I
 get a
  segfault in the test___all__ test case.
 
  Before digging deeper into the issue, I wanted to ask here if there are
 any
  implicit assumptions about string identity and interning throughout the
  interpreter implementation. For instance, are two single-char strings
  having the same content supposed to be identical objects?
 
  From a language POV, no, but inside a specific interpreter such as
  CPython it may be a reasonable expectation.
 
  I'm assuming that it's either this, or some refcount bug in the
 interpreter
  that manifests only when certain strings are no longer interned and thus
  have a higher chance to get low refcount values.
 
  Indeed, if it's a real bug it would be nice to get it fixed :-)

 By the way, in that case, what would be the best way to debug such
 type of ref count errors? I recently ran across this document [1],
 which kind of applies to debugging focused on newly introduced code.
 But when some changes potentially impact a good fraction of the
 interpreter, where should I look first?

 I'm asking since I re-ran the failing test with gdb, and the segfault
 seems to occur when invoking the kill() syscall, so the error seems to
 manifest at some later point than when the faulty code is executed.

 Stefan

 [1] http://www.python.org/doc/essays/refcnt/
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 http://mail.python.org/mailman/options/python-dev/lukas.lueg%40gmail.com

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Disabling string interning for null and single-char causes segfaults

2013-03-02 Thread Antoine Pitrou
On Sat, 2 Mar 2013 22:13:56 +0100
Stefan Bucur stefan.bu...@gmail.com wrote:

 On Sat, Mar 2, 2013 at 4:31 PM, Antoine Pitrou solip...@pitrou.net wrote:
  On Fri, 1 Mar 2013 16:24:42 +0100
  Stefan Bucur stefan.bu...@gmail.com wrote:
 
  However, after applying this modification, when running make test I get a
  segfault in the test___all__ test case.
 
  Before digging deeper into the issue, I wanted to ask here if there are any
  implicit assumptions about string identity and interning throughout the
  interpreter implementation. For instance, are two single-char strings
  having the same content supposed to be identical objects?
 
  From a language POV, no, but inside a specific interpreter such as
  CPython it may be a reasonable expectation.
 
  I'm assuming that it's either this, or some refcount bug in the interpreter
  that manifests only when certain strings are no longer interned and thus
  have a higher chance to get low refcount values.
 
  Indeed, if it's a real bug it would be nice to get it fixed :-)
 
 By the way, in that case, what would be the best way to debug such
 type of ref count errors? I recently ran across this document [1],
 which kind of applies to debugging focused on newly introduced code.

That documents looks a bit outdated (1998!).
I would suggest you enable core dumps (`ulimit -c unlimited`), then let
Python crash and inspect the stack trace with gdb.
You will get better results if using a debug build and the modern gdb
inspection helpers:
http://docs.python.org/devguide/gdb.html

Oh, by the way, it would be better to do your work on Python 3 rather
than 2.7. Either the `default` branch or the `3.3` branch, I guess.
See http://docs.python.org/devguide/setup.html#checkout

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com