[issue14320] set.add can return boolean indicate newly added item

2012-03-16 Thread Matt Joiner

Matt Joiner anacro...@gmail.com added the comment:

Is there still some value to at least exposing this in the C API, per the 
precedents I mentioned? 

The patch also contains some adjustment to the set_add_entry/set_add_key 
abstraction dance, and some future proofing of PySet_Add return values that 
have merit on their own.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14320
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14320] set.add can return boolean indicate newly added item

2012-03-16 Thread Giampaolo Rodola'

Changes by Giampaolo Rodola' g.rod...@gmail.com:


--
nosy: +giampaolo.rodola

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14320
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14320] set.add can return boolean indicate newly added item

2012-03-15 Thread Matt Joiner

New submission from Matt Joiner anacro...@gmail.com:

set.add can return True to indicate a newly added item to the set, or False if 
the item was already present.

The C function PySet_Add returns -1 on error, and 0 on success currently. This 
is extended to return 1 if the item is newly added, and 0 if it was already 
present. Precedents exist for PySet_Contains and PySet_Discard with the same 
semantics. There are only 3 calls that need to be patched in the entire core, 
these are included in the patch.

The Python function set.add currently returns None. This is extended to return 
True if if the item is new to the set, and False otherwise. No object overhead 
is introduced as set.add is already currently executing Py_RETURN_NONE.

Benchmarks indicate that set.add if the item isn't already present (returning 
True in the patched instance) reduces time by 5-9%. set.add if the item already 
exists increases time by 1-3%. I'm happy to put these down to effects of 
touching True and False instead of None on return from set.add.

Patch and primitive performance test attached.

--
components: Interpreter Core
files: set-add-return-bool.patch
keywords: patch
messages: 155909
nosy: anacrolix, eric.araujo, gvanrossum, petri.lehtinen
priority: normal
severity: normal
status: open
title: set.add can return boolean indicate newly added item
type: enhancement
versions: Python 3.3
Added file: http://bugs.python.org/file24862/set-add-return-bool.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14320
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14320] set.add can return boolean indicate newly added item

2012-03-15 Thread Matt Joiner

Changes by Matt Joiner anacro...@gmail.com:


Added file: http://bugs.python.org/file24863/bench_set_add.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14320
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14320] set.add can return boolean indicate newly added item

2012-03-15 Thread Raymond Hettinger

Changes by Raymond Hettinger raymond.hettin...@gmail.com:


--
assignee:  - rhettinger
nosy: +rhettinger

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14320
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14320] set.add can return boolean indicate newly added item

2012-03-15 Thread Raymond Hettinger

Raymond Hettinger raymond.hettin...@gmail.com added the comment:

The part of the patch for PySet_Add() is a reasonable improvement to the C API 
if it doesn't conflict with Martin's stable ABI effort.

The question of whether to change the Python API requires much more thought and 
I'll do some research and evaluate it more thoroughly over the next few weeks.  
Here are some of the considerations:

* The set API currently has a near zero learning curve.  We want to keep it 
that way.  I'm teaching classes over the next few weeks and will try out the 
proposal on my students.

* For collections that are commonplace in other languages, I look to their 
experience and design for inspiration.  I'll look at was done in Smalltalk, 
Java, and ObjectiveC (with dynamic languages being a better model than 
statically compiled languages).  In particular, I look to SETL when evaluating 
the utility of proposed changes to the set API (a little like looking to Matlab 
when thinking about designing a matrix API).

* I'm concerned about the intuitiveness of the polarity of the proposed method 
and will try it out on other programmers to see whether if s.add(e): ... gets 
interpreted as true if e is already added or true if the adding a new item. 
 The sense of set.add() is the opposite of set.__contains__, so we should be 
careful about making a API change with an ambiguous or error-prone 
interpretation.

* As written, the proposal seems to be about efficiency rather than clarity.  
I'll run my own timings to see if they make any difference in typical 
applications of set.add().  In addition, I'll consult the Jython folks to see 
if it makes a difference in their world (I suspect it won't -- they use native 
Java objects and the Java JIT handily optimizes away the traditional calling 
pattern).  Also, I'll consult the PyPy folks to see whether they can provide 
the optimization automatically rather than via an API change.

* The suggested API also needs to be viewed in the context of what other Python 
APIs do.  For the most part, the language has an aversion to combining tests 
and assignments.  For example, Python doesn't do while (buf = 
f.read(bufsize)): ... eventhough that is traditionally supported in statically 
compiled languages.  There is a precedent with dict.setdefault(); however, that 
is often regarded as one of the least beautiful parts of the API in Python's 
basic collection objects.

* I also want to look back a previous discussions on this topic.  The set API 
had a slow and careful evolution starting with a PEP, being exposed as a pure 
python module, and being coded in C as a builtin type.   The API was built 
by Alex Martelli, Guido, Tim Peters, Greg Wilson and myself with substantial 
input from the community.  None of the designers sought to include this 
functionality and it wasn't because it hadn't occurred to the them or that they 
were unaware of typical use cases.  In addition, having set.add() return a 
boolean was discussed and rejected on python-dev (I've forgotten whether it was 
last year or the year before).  Some care should be taken before dismissing the 
judgment of the designers who've previously spent time thinking this out.

* Lastly, we need to look at code examples to see whether they read better or 
whether clarity is being lost in the name of efficiency.  We should look at 
both sophisticated examples (i.e. sets are part of multistep logic) and minimal 
examples (i.e. where the set logic is dominant).  Here is a before-and-after 
for the minimal case:

def dedup_before(iterable):
'''Order preserving elimination of duplicates'''
seen = set()
for i in iterable:
if i not in seen:
seen.add(i)
yield i

def dedup_after(iterable):
'''Order preserving elimination of duplicates'''
seen = Set()
for i in iterable:
if seen.add(i):
yield i


As you can see, there is more to API design than just spotting an opportunity 
to fold two steps into one.

--
priority: normal - low
type: enhancement - performance

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14320
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14320] set.add can return boolean indicate newly added item

2012-03-15 Thread Raymond Hettinger

Raymond Hettinger raymond.hettin...@gmail.com added the comment:

Here's the results of the research so far:

Guido gave it a -1:   
http://mail.python.org/pipermail/python-ideas/2012-March/014510.html

Smalltalk does not return a boolean:
http://www.gnu.org/software/smalltalk/manual-base/gst-base.html#Set
http://www.gnu.org/software/smalltalk/manual-base/gst-base.html#HashedCollection
add: newObject
Add newObject to the set, if and only if the set doesn't already contain an 
occurrence of it. Don't fail if a duplicate is found. Answer anObject

Java does return a boolean for Interface SetE: 
http://docs.oracle.com/javase/6/docs/api/
boolean add(E e)  Adds the specified element to this set if it is not 
already present (optional operation)

The SETL programming language does not return a boolean for the set plus one 
element operation.  Instead, it returns the new set.
http://setl.org/setl/doc/setl-lib.html#with

Real-world examples of code using set.add in a way that benefits from the 
boolean result do not look promising.  They match the pattern in the minimal 
example shown in the previous post, if e in previously_seen.add(e): ...   
This is problematic because it can easily be read as take this branch if e has 
been previously seen.

I consulted with the PyPy developers and found that PyPy is already eliminating 
double lookups in common cases (it recognizes that the two lookups refer to the 
same entry and constant folds them).  It cannot do this in the general case 
because of a potential semantic change (i.e. __eq__ can change the set during 
the initial lookup).

Looking at Jython, it appears that both sets and dicts are based on Java's 
concurrent mapping object which doesn't support a boolean result directly but 
allows the JIT to optimize away the double lookup depending on the JIT and on 
whether the keys are of all the same type.

One other note, in general the Python language makes no guarantees about 
atomicity (different implementations are free to make different implementation 
choices).  For example, people relying on dict.setdefault() to be atomic were 
making an incorrect assumption (until recently).  Our standing recommendation 
is to use locks unless your willing to rely on an implementation detail.  In 
the case of set.add(), the race condition is harmless since set.add() is 
idempotent.

I've shown sample code to some other developers and they had mixed guesses 
about the polarity of s.add(e), whether it would return True or False if the e 
was already in the set.  A mixed result is not good and implies that the 
proposal is error prone.  

Also, when shown the dedup() code example, the devs didn't feel that the 
additional conciseness was worth it (one said, it just looks funny and causes 
me to do a double-take, but the original code could be read effortlessly).

My timings on the dedup() code showed a 2% speedup on an iterable of strings 
with 80% duplicates and no improvement on an iterable of strings with 5% 
duplicates.  The improvement was mostly erased if a seen_add = seen.add bound 
method was used inside the loop.  This indicates that the double lookup is 
cheap compared to the cost of the dotted lookup in seen.add(e).  That being 
said, objects with a custom __hash__ would benefit from calling __hash__ only 
once.

Given that the results above were uniformly negative (except for the Java 
library), I'll skip trying this out on my students and am rejecting the 
suggested change to the pure Python API.  We appreciate your suggestion but it 
isn't appropriate for inclusion in the language.

Some of those concerns may also apply to the suggested change to the C API

--
resolution:  - rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14320
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com