Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-23 Thread Inada Naoki
On Wed, Apr 24, 2019 at 6:17 AM Mark Shannon  wrote:
>
> Hi,
>
> On 12/04/2019 2:44 pm, Inada Naoki wrote:
> > Hi, all.
> >
> > I propose adding new method: dict.with_values(iterable)
>
> You can already do something like this, if memory saving is the main
> concern. This should work on all versions from 3.3.
>

Of course, performance is main concern too.

-- 
Inada Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-23 Thread Mark Shannon

Hi,

On 12/04/2019 2:44 pm, Inada Naoki wrote:

Hi, all.

I propose adding new method: dict.with_values(iterable)


You can already do something like this, if memory saving is the main 
concern. This should work on all versions from 3.3.



def shared_keys_dict_maker(keys):
class C: pass
instance = C()
for key in keys:
for key in keys:
setattr(instance, key, None)
prototype = instance.__dict__
def maker(values):
result = prototype.copy()
result.update(zip(keys, values))
return result
return maker

m = shared_keys_dict_maker(('a', 'b'))

>>> d1 = {'a':1, 'b':2}
>>> print(sys.getsizeof(d1))
... 248

>>> d2 = m((1,2))
>>> print(sys.getsizeof(d2))
... 120

>>> d3 = m((None,"Hi"))
>>> print(sys.getsizeof(d3))
... 120





# Motivation

Python is used to handle data.
While dict is not efficient way to handle may records, it is still
convenient way.

When creating many dicts with same keys, dict need to
lookup internal hash table while inserting each keys.

It is costful operation.  If we can reuse existing keys of dict,
we can skip this inserting cost.

Additionally, we have "Key-Sharing Dictionary (PEP 412)".
When all keys are string, many dict can share one key.
It reduces memory consumption.

This might be usable for:

* csv.DictReader
* namedtuple._asdict()
* DB-API 2.0 implementations:  (e.g. DictCursor of mysqlclient-python)


# Draft implementation

pull request: https://github.com/python/cpython/pull/12802

with_values(self, iterable, /)
 Create a new dictionary with keys from this dict and values from iterable.

 When length of iterable is different from len(self), ValueError is raised.
 This method does not support dict subclass.


## Memory usage (Key-Sharing dict)


import sys
keys = tuple("abcdefg")
keys

('a', 'b', 'c', 'd', 'e', 'f', 'g')

d = dict(zip(keys, range(7)))
d

{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6}

sys.getsizeof(d)

360


keys = dict.fromkeys("abcdefg")
d = keys.with_values(range(7))
d

{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6}

sys.getsizeof(d)

144

## Speed

$ ./python -m perf timeit -o zip_dict.json -s 'keys =
tuple("abcdefg"); values=[*range(7)]' 'dict(zip(keys, values))'

$ ./python -m perf timeit -o with_values.json -s 'keys =
dict.fromkeys("abcdefg"); values=[*range(7)]'
'keys.with_values(values)'

$ ./python -m perf compare_to zip_dict.json with_values.json
Mean +- std dev: [zip_dict] 935 ns +- 9 ns -> [with_values] 109 ns +-
2 ns: 8.59x faster (-88%)


How do you think?
Any comments are appreciated.

Regards,


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-23 Thread Serhiy Storchaka

12.04.19 19:17, Inada Naoki пише:

Maybe, collections.DictBuilder may be another option.  e.g.


from collections import DictBuilder
builder = DictBuilder(tuple("abc"))
builder.build(range(3))

{"a": 0, "b": 1, "c": 2}


Nitpicking: this is rather a factory than a builder. The difference 
between the patterns is that you create a new builder object for every dict:


builder = DictBuilder()
builder['a'] = 0
builder['b'] = 1
builder['c'] = 2
result = builder.build()

and create a fabric only for the whole class of dicts:

factory = DictFactory(tuple("abc"))  # only once
...
result = factory(range(3))

I like the idea of a factory object more than the idea of the dict method.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-23 Thread Inada Naoki
On Wed, Apr 24, 2019 at 12:34 AM Steve Dower  wrote:
>
> >> If it's a key-sharing dict, then all the keys are strings. We know that
> >> when we go to do the update, so we can intern all the strings (going to
> >> do that anyway) and then it's a quick check if it already exists. If
> >> it's a regular dict, then we calculate hashes as normal. Updating the
> >> value is just a decref, incref and assignment.
> >
> > There are some problem.
> >
> > 1. Searching hash table is not zero-cost, comparing to appending to 
> > sequence.
> > This cost is very near to building new hash tables.
>
> If we know that you're sharing keys with the new items then we can skip
> the search. This was my point about the d2 = copy(d1); d2.update(zip(d2,
> values)) idea:
>

OK, I got it.
But note that zip object doesn't expose items, neither Python level or C level.


> > 2. In my proposal, main user is csv.DictReader or sql.DictCursor.
> > They parse only values on each rows.   So they need to use map.
>
> In that case, use a private helper. _csv already has a native module. We
> don't need to add new public APIs for internal optimisations, provided
> there is a semantically equivalent way to do it without the internal API.

csv is stdlib.  But there are some third party extensions similar to csv.

>
> > 3. (CPython only) dict.copy(), dict(dict), and dict.update() are general 
> > purpose
> > methods.  There is no obvious place to start using key-sharing dict.
>
> See my reply to Glenn, but potentially fromkeys() could start with the
> key-sharing dict and then copy()/dict() could continue sharing it
> (hopefully they already do?).

Key-sharing dict is used only for instance dict at the moment.

2nd argument of dict.fromkeys() is value, not values.
How about adding dict.fromkeyvalues(keys, values)?
When keys is dict, it's behavior is same to my first proposal
(`dict.with_values(d1, values)`).

> >
> > If *CPython* specialized dict(zip(dict, values)),  it still be CPython
> > implementation detail.
> > Do you want recommend using such CPython hacky optimization?
> > Should we use such optimization in stdlib, even if it will be slower
> > than dict(zip(keys_tuple, values)) on some other Python implementations?
>
> We do "hacky" optimisations everywhere :) The point of the runtime is to
> let users write code that works and we do the effort behind the scenes
> to make it efficient. We're not C - we're here to help our users.

But we avoid CPython-only hack which will make stdlib slower on other
Python implementations as possible.
For example, we optimize `s1 += s` loop.  But we use `''.join(list_of_str)`
instead of it.

>
> The point is that it will work on other implementations - including
> previous versions of CPython - and those are free to optimise it however
> they like.
>
> > Or do you propose making dict(zip(dict, values)) optimization as
> > language specification?
>
> Definitely not! It's just a pattern that we have the ability to
> recognize and optimize at runtime, so why not do it?

Why we need to recommend patterns fast only in CPython?

  d2 = dict.fromkeys(keys_dict)   # make key sharing dict, only in CPython 3.8+
  d2.update(zip(d2, row))  # update values without key lookup, only in
CPython 3.8+

Obviously, this may be much slower than `d2 = dict(zip(keys_tuple, row))` on
current CPython and other Python implementations.

Note that this pattern will be used when dict creation is bottleneck.

If we has specialized API, libraries can use it if the API is available,
and use dict(zip(keys, row)) otherwise.


>
> > One obvious advantage of having DictBuilder is it is for specific
> > purpose.  It has at least same performance to dict(zip(keys, values))
> > on all Python implementations.
> > Libraries like csv parser can use it without worrying about its performance
> > on Python other than CPython.
>
> A singular purpose isn't necessarily an obvious advantage. We're better
> off with generic building blocks that our users can compose in ways that
> were originally non-obvious (and then as patterns emerge we can look at
> ways to simplify or formalise them).

In generic building blocks, we can not know user will create massive dicts
with same keys or just creating one copy.  We need to guess, and the guess
may be wrong.

Regards,

-- 
Inada Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-23 Thread Inada Naoki
On Wed, Apr 24, 2019 at 12:28 AM Steve Dower  wrote:
>
> >
> > But if the original dictionary wasn't created with shared keys... the
> > copy can't share them either.  Or are you suggesting adding new code to
> > create a shared key dictionary from one that isn't?
>
> This is a good point. Maybe dict.fromkeys() could do it? Or a
> sys.intern-like function (which is why I brought up that precedent). The
> point is to make it an optional benefit rather than strict
> language/library semantics.
>

Then, why not support values when creating key sharing dict?
That's one form of my proposal :)

-- 
Inada Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-23 Thread Steve Dower

On 23Apr2019 0008, Inada Naoki wrote:

On Tue, Apr 23, 2019 at 2:54 PM Steve Dower  wrote:


On 22Apr2019 2143, Inada Naoki wrote:

On Tue, Apr 23, 2019 at 11:30 AM Steve Dower  wrote:


Or possibly just "dict(existing_dict).update(new_items)".



Do you mean .update accepts values tuple?
I can't think it's


Not sure what you were going to go on to say here, but why not?


Sorry, I sent mail without finishing.

dict.update() has too many overloading.
Adding values_tuple is impossible without breaking backward compatibility.

But I think you're saying about items_sequence, not values.


Right. I'm specifically trying to avoid changing public APIs at all 
(including adding anything new, if possible) by identifying suitable 
patterns that we can handle specially to provide a transparent speed 
improvement.



If it's a key-sharing dict, then all the keys are strings. We know that
when we go to do the update, so we can intern all the strings (going to
do that anyway) and then it's a quick check if it already exists. If
it's a regular dict, then we calculate hashes as normal. Updating the
value is just a decref, incref and assignment.


There are some problem.

1. Searching hash table is not zero-cost, comparing to appending to sequence.
This cost is very near to building new hash tables.


If we know that you're sharing keys with the new items then we can skip 
the search. This was my point about the d2 = copy(d1); d2.update(zip(d2, 
values)) idea:


def update(self, items):
if isinstance(items, ZipObject): # whatever the type is called
if are_sharing_keys(self, items.sequence_1):
# fast update from iter(items.sequence_2)
return
# regular update from iter(items)

Totally transparent and encourages composition of existing builtins. 
It's a bit of a trick and may not be as obvious as a new method, but 
it's backwards compatible at least as far as ordered dicts (which is a 
requirement of any of these approaches anyway, yes?)



2. In my proposal, main user is csv.DictReader or sql.DictCursor.
They parse only values on each rows.   So they need to use map.


In that case, use a private helper. _csv already has a native module. We 
don't need to add new public APIs for internal optimisations, provided 
there is a semantically equivalent way to do it without the internal API.



3. (CPython only) dict.copy(), dict(dict), and dict.update() are general purpose
methods.  There is no obvious place to start using key-sharing dict.


See my reply to Glenn, but potentially fromkeys() could start with the 
key-sharing dict and then copy()/dict() could continue sharing it 
(hopefully they already do?).



That's why I proposed specific method / function for specific purpose.



Note that it .update() would probably require a dict or key/value tuples
here - but if you have the keys in a tuple already then zip() is going
to be good enough for setting it (in fact, zip(existing_dict,
new_values) should be fine, and we can internally special-case that
scenario, too).


If *CPython* specialized dict(zip(dict, values)),  it still be CPython
implementation detail.
Do you want recommend using such CPython hacky optimization?
Should we use such optimization in stdlib, even if it will be slower
than dict(zip(keys_tuple, values)) on some other Python implementations?


We do "hacky" optimisations everywhere :) The point of the runtime is to 
let users write code that works and we do the effort behind the scenes 
to make it efficient. We're not C - we're here to help our users.


The point is that it will work on other implementations - including 
previous versions of CPython - and those are free to optimise it however 
they like.



Or do you propose making dict(zip(dict, values)) optimization as
language specification?


Definitely not! It's just a pattern that we have the ability to 
recognize and optimize at runtime, so why not do it?



One obvious advantage of having DictBuilder is it is for specific
purpose.  It has at least same performance to dict(zip(keys, values))
on all Python implementations.
Libraries like csv parser can use it without worrying about its performance
on Python other than CPython.


A singular purpose isn't necessarily an obvious advantage. We're better 
off with generic building blocks that our users can compose in ways that 
were originally non-obvious (and then as patterns emerge we can look at 
ways to simplify or formalise them).



(Randomizing side note: is this scenario enough to make a case for a
built-in data frame type?)


https://xkcd.com/927/


Yep. The difference is that as the language team, our standard wins by 
default ;)


(For those who don't click links, it's pointing at the "let's make a new 
standard" XKCD comic)



* when you only d2.update existing keys, no need to rebuild the table
* a duplicated key overwrites multiple times - what else are you going
to do?


But all keys should be looked up.  It is very similar overhead to 

Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-23 Thread Steve Dower

On 23Apr2019 0034, Glenn Linderman wrote:

On 4/22/2019 10:59 PM, Steve Dower wrote:

On 22Apr2019 2119, Glenn Linderman wrote:
While Inada's suggested DictBuilder interface was immediately 
obvious, I don't get how either copy or update would achieve the 
goal. Perhaps you could explain? Particularly, what would be the 
trigger that would make dict() choose to create a shared key 
dictionary from the start? Unless it is known that there will be lots 
of (mostly static) dictionaries with the same set of keys at the time 
of creation of the first one, creating a shared key dictionary in 
every case would cause later inefficiencies in converting them, when 
additional items are added? (I'm assuming without knowledge that a 
single shared key dictionary is less efficient than a single regular 
dictionary.)


Passing a dictionary to the dict() constructor creates a copy of that 
dictionary (as does copy.copy()). What other trigger do you need to 
decide "it contains the same keys"? It's a copy of the original dict, 
so by definition at that point it may as well share its entire 
contents with the original.


But if the original dictionary wasn't created with shared keys... the 
copy can't share them either.  Or are you suggesting adding new code to 
create a shared key dictionary from one that isn't?


This is a good point. Maybe dict.fromkeys() could do it? Or a 
sys.intern-like function (which is why I brought up that precedent). The 
point is to make it an optional benefit rather than strict 
language/library semantics.


Is there a cost to using a key sharing dict that is prohibitive when the 
keys aren't actually being shared?


Cheers,
Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-23 Thread Glenn Linderman

On 4/22/2019 10:59 PM, Steve Dower wrote:

On 22Apr2019 2119, Glenn Linderman wrote:
While Inada's suggested DictBuilder interface was immediately 
obvious, I don't get how either copy or update would achieve the 
goal. Perhaps you could explain? Particularly, what would be the 
trigger that would make dict() choose to create a shared key 
dictionary from the start? Unless it is known that there will be lots 
of (mostly static) dictionaries with the same set of keys at the time 
of creation of the first one, creating a shared key dictionary in 
every case would cause later inefficiencies in converting them, when 
additional items are added? (I'm assuming without knowledge that a 
single shared key dictionary is less efficient than a single regular 
dictionary.)


Passing a dictionary to the dict() constructor creates a copy of that 
dictionary (as does copy.copy()). What other trigger do you need to 
decide "it contains the same keys"? It's a copy of the original dict, 
so by definition at that point it may as well share its entire 
contents with the original.


But if the original dictionary wasn't created with shared keys... the 
copy can't share them either.  Or are you suggesting adding new code to 
create a shared key dictionary from one that isn't?


Basically this is just a partial copy-on-write, where we copy values 
eagerly - since they're almost certainly going to change - and keys 
lazily - since there are known scenarios where they are not going to 
be changed, but we'll pay the cost later if it turns out they are.


Cheers,
Steve



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-23 Thread Inada Naoki
On Tue, Apr 23, 2019 at 2:54 PM Steve Dower  wrote:
>
> On 22Apr2019 2143, Inada Naoki wrote:
> > On Tue, Apr 23, 2019 at 11:30 AM Steve Dower  wrote:
> >>
> >> Or possibly just "dict(existing_dict).update(new_items)".
> >>
> >
> > Do you mean .update accepts values tuple?
> > I can't think it's
>
> Not sure what you were going to go on to say here, but why not?

Sorry, I sent mail without finishing.

dict.update() has too many overloading.
Adding values_tuple is impossible without breaking backward compatibility.

But I think you're saying about items_sequence, not values.

>
> If it's a key-sharing dict, then all the keys are strings. We know that
> when we go to do the update, so we can intern all the strings (going to
> do that anyway) and then it's a quick check if it already exists. If
> it's a regular dict, then we calculate hashes as normal. Updating the
> value is just a decref, incref and assignment.

There are some problem.

1. Searching hash table is not zero-cost, comparing to appending to sequence.
   This cost is very near to building new hash tables.

2. In my proposal, main user is csv.DictReader or sql.DictCursor.
   They parse only values on each rows.   So they need to use map.

3. (CPython only) dict.copy(), dict(dict), and dict.update() are general purpose
   methods.  There is no obvious place to start using key-sharing dict.

That's why I proposed specific method / function for specific purpose.

>
> Note that it .update() would probably require a dict or key/value tuples
> here - but if you have the keys in a tuple already then zip() is going
> to be good enough for setting it (in fact, zip(existing_dict,
> new_values) should be fine, and we can internally special-case that
> scenario, too).

If *CPython* specialized dict(zip(dict, values)),  it still be CPython
implementation detail.
Do you want recommend using such CPython hacky optimization?
Should we use such optimization in stdlib, even if it will be slower
than dict(zip(keys_tuple, values)) on some other Python implementations?

Or do you propose making dict(zip(dict, values)) optimization as
language specification?

One obvious advantage of having DictBuilder is it is for specific
purpose.  It has at least same performance to dict(zip(keys, values))
on all Python implementations.
Libraries like csv parser can use it without worrying about its performance
on Python other than CPython.


> I'd assumed the benefit was in memory usage after
> construction, rather than speed-to-construct, since everyone keeps
> talking about "key-sharing dictionaries" and not "arrays" ;)

Both is important.  I had talked about non key-sharing dict.

> (Randomizing side note: is this scenario enough to make a case for a
> built-in data frame type?)

https://xkcd.com/927/


> >> My primary concern is still to avoid making CPython performance
> >> characteristics part of the Python language definition. That only makes
> >> it harder for alternate implementations.
> >
> > Note that this proposal is not only for key sharing dict:
> >
> > * We can avoid rebuilding hash table again and again.
> > * We can avoid checking duplicated keys again and again.
> >
> > These characteristics are not only for Python, but for all mapping
> > implementations using hash table.
>
> I believe all of these are met by making d2=dict(d1) construct a dict d2
> that shares keys with d1 by default. Can you show how they are not?

If you want only copy, it's same.

>
> * when you only d2.update existing keys, no need to rebuild the table
> * a duplicated key overwrites multiple times - what else are you going
> to do?

But all keys should be looked up.  It is very similar overhead to rebuilding
hash table.

> This is already easiest, fastest, uses the least memory and is
> most consistent with every other form of setting dict items. Why
> complicate things by checking them? Let the caller do it

As I wrote above, it is:

* slower than my proposal.
* no obvious place to start using key sharing dict.


-- 
Inada Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-23 Thread Steve Dower

On 22Apr2019 2119, Glenn Linderman wrote:
While Inada's suggested DictBuilder interface was immediately obvious, I 
don't get how either copy or update would achieve the goal. Perhaps you 
could explain? Particularly, what would be the trigger that would make 
dict() choose to create a shared key dictionary from the start? Unless 
it is known that there will be lots of (mostly static) dictionaries with 
the same set of keys at the time of creation of the first one, creating 
a shared key dictionary in every case would cause later inefficiencies 
in converting them, when additional items are added? (I'm assuming 
without knowledge that a single shared key dictionary is less efficient 
than a single regular dictionary.)


Passing a dictionary to the dict() constructor creates a copy of that 
dictionary (as does copy.copy()). What other trigger do you need to 
decide "it contains the same keys"? It's a copy of the original dict, so 
by definition at that point it may as well share its entire contents 
with the original.


Basically this is just a partial copy-on-write, where we copy values 
eagerly - since they're almost certainly going to change - and keys 
lazily - since there are known scenarios where they are not going to be 
changed, but we'll pay the cost later if it turns out they are.


Cheers,
Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-23 Thread Steve Dower

On 22Apr2019 2143, Inada Naoki wrote:

On Tue, Apr 23, 2019 at 11:30 AM Steve Dower  wrote:


Or possibly just "dict(existing_dict).update(new_items)".



Do you mean .update accepts values tuple?
I can't think it's


Not sure what you were going to go on to say here, but why not?

If it's a key-sharing dict, then all the keys are strings. We know that 
when we go to do the update, so we can intern all the strings (going to 
do that anyway) and then it's a quick check if it already exists. If 
it's a regular dict, then we calculate hashes as normal. Updating the 
value is just a decref, incref and assignment.


If not all these conditions are met, we convert to a regular dict. The 
proposed function was going to raise an error in this case, so all we've 
done is make it transparent. The biggest downside is now you don't get a 
warning that your preferred optimization isn't actually working when you 
pass in new_items with different keys from what were in existing_dict.


Note that it .update() would probably require a dict or key/value tuples 
here - but if you have the keys in a tuple already then zip() is going 
to be good enough for setting it (in fact, zip(existing_dict, 
new_values) should be fine, and we can internally special-case that 
scenario, too). I'd assumed the benefit was in memory usage after 
construction, rather than speed-to-construct, since everyone keeps 
talking about "key-sharing dictionaries" and not "arrays" ;)


(Randomizing side note: is this scenario enough to make a case for a 
built-in data frame type?)



My primary concern is still to avoid making CPython performance
characteristics part of the Python language definition. That only makes
it harder for alternate implementations.


Note that this proposal is not only for key sharing dict:

* We can avoid rebuilding hash table again and again.
* We can avoid checking duplicated keys again and again.

These characteristics are not only for Python, but for all mapping
implementations using hash table.


I believe all of these are met by making d2=dict(d1) construct a dict d2 
that shares keys with d1 by default. Can you show how they are not?


* when you only d2.update existing keys, no need to rebuild the table
* a duplicated key overwrites multiple times - what else are you going 
to do? This is already easiest, fastest, uses the least memory and is 
most consistent with every other form of setting dict items. Why 
complicate things by checking them? Let the caller do it


Cheers,
Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-22 Thread Inada Naoki
On Tue, Apr 23, 2019 at 11:30 AM Steve Dower  wrote:
>
> Or possibly just "dict(existing_dict).update(new_items)".
>

Do you mean .update accepts values tuple?
I can't think it's

> My primary concern is still to avoid making CPython performance
> characteristics part of the Python language definition. That only makes
> it harder for alternate implementations.

Note that this proposal is not only for key sharing dict:

* We can avoid rebuilding hash table again and again.
* We can avoid checking duplicated keys again and again.

These characteristics are not only for Python, but for all mapping
implementations using hash table.

-- 
Inada Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-22 Thread Glenn Linderman

On 4/22/2019 7:27 PM, Steve Dower wrote:

On 22Apr2019 1921, Steve Dower wrote:

On 22Apr2019 1822, Glenn Linderman wrote:
Inada is now proposing a way to allow the coder to suggest a group 
of dictionaries that might benefit from the same gains, by 
preclassifying non-__dict__ slot dictionaries to do similar sharing.


CSV reader is an exemplary candidate, because it creates groups of 
dicts that use the same keys. (column names). I have other code that 
does similar things, that would get similar benefits.


Seems like since it is just an interface to existing builtin code, 
that the one interface function (or dictionary factory class) could 
just as well be a builtin function, instead of requiring an import.


Sounds like a similar optimisation to sys.intern() is for strings.

I see no reason to try and avoid an import here - it's definitely a 
special-case situation - but otherwise having a function to say 
"clone and update this dict" that starts by sharing the keys in the 
same way that __dict__ does (including the transformation when 
necessary) seems like an okay addition. Maybe copy() could just be 
enabled for this?


Or possibly just "dict(existing_dict).update(new_items)".

My primary concern is still to avoid making CPython performance 
characteristics part of the Python language definition. That only 
makes it harder for alternate implementations. (Even though I was 
out-voted last time on this issue since all the publicly-known 
alternate implementations said it would be okay... I'm still going to 
put in a vote for avoiding new language semantics for the sake of a 
single runtime's performance characteristics.)


I note that dict() doesn't have a method to take two parallel iterables 
of keys/values and create a dict... if it did, that could be a trigger 
that a shared key dict might be appropriate... it seems more likely that 
data in that form is dealing with rows and columns, instead of the forms 
currently accepted by dict().


Perhaps an alternate constructor that took data in that form, AND 
defined an optional parameter to trigger a shared dict, would be a 
useful addition to the language.  Other implementations could ignore the 
optional parameter if they want, and the implementation would be a 
one-liner calling the current constructor and zip()ing the parameters.


The alternate constructor would be nice even if shared key dicts were 
not particularly needed in an application, and would provide a method of 
adding a trigger for the shared key optimization when appropriate.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-22 Thread Glenn Linderman

On 4/22/2019 7:27 PM, Steve Dower wrote:

On 22Apr2019 1921, Steve Dower wrote:

On 22Apr2019 1822, Glenn Linderman wrote:
Inada is now proposing a way to allow the coder to suggest a group 
of dictionaries that might benefit from the same gains, by 
preclassifying non-__dict__ slot dictionaries to do similar sharing.


CSV reader is an exemplary candidate, because it creates groups of 
dicts that use the same keys. (column names). I have other code that 
does similar things, that would get similar benefits.


Seems like since it is just an interface to existing builtin code, 
that the one interface function (or dictionary factory class) could 
just as well be a builtin function, instead of requiring an import.


Sounds like a similar optimisation to sys.intern() is for strings.

I see no reason to try and avoid an import here - it's definitely a 
special-case situation - but otherwise having a function to say 
"clone and update this dict" that starts by sharing the keys in the 
same way that __dict__ does (including the transformation when 
necessary) seems like an okay addition. Maybe copy() could just be 
enabled for this?


Or possibly just "dict(existing_dict).update(new_items)".

My primary concern is still to avoid making CPython performance 
characteristics part of the Python language definition. That only 
makes it harder for alternate implementations. (Even though I was 
out-voted last time on this issue since all the publicly-known 
alternate implementations said it would be okay... I'm still going to 
put in a vote for avoiding new language semantics for the sake of a 
single runtime's performance characteristics.)


While Inada's suggested DictBuilder interface was immediately obvious, I 
don't get how either copy or update would achieve the goal. Perhaps you 
could explain? Particularly, what would be the trigger that would make 
dict() choose to create a shared key dictionary from the start? Unless 
it is known that there will be lots of (mostly static) dictionaries with 
the same set of keys at the time of creation of the first one, creating 
a shared key dictionary in every case would cause later inefficiencies 
in converting them, when additional items are added? (I'm assuming 
without knowledge that a single shared key dictionary is less efficient 
than a single regular dictionary.)


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-22 Thread Steve Dower

On 22Apr2019 1921, Steve Dower wrote:

On 22Apr2019 1822, Glenn Linderman wrote:
Inada is now proposing a way to allow the coder to suggest a group of 
dictionaries that might benefit from the same gains, by preclassifying 
non-__dict__ slot dictionaries to do similar sharing.


CSV reader is an exemplary candidate, because it creates groups of 
dicts that use the same keys. (column names). I have other code that 
does similar things, that would get similar benefits.


Seems like since it is just an interface to existing builtin code, 
that the one interface function (or dictionary factory class) could 
just as well be a builtin function, instead of requiring an import.


Sounds like a similar optimisation to sys.intern() is for strings.

I see no reason to try and avoid an import here - it's definitely a 
special-case situation - but otherwise having a function to say "clone 
and update this dict" that starts by sharing the keys in the same way 
that __dict__ does (including the transformation when necessary) seems 
like an okay addition. Maybe copy() could just be enabled for this?


Or possibly just "dict(existing_dict).update(new_items)".

My primary concern is still to avoid making CPython performance 
characteristics part of the Python language definition. That only makes 
it harder for alternate implementations. (Even though I was out-voted 
last time on this issue since all the publicly-known alternate 
implementations said it would be okay... I'm still going to put in a 
vote for avoiding new language semantics for the sake of a single 
runtime's performance characteristics.)


Cheers,
Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-22 Thread Steve Dower

On 22Apr2019 1822, Glenn Linderman wrote:
Inada is now proposing a way to allow the coder to suggest a group of 
dictionaries that might benefit from the same gains, by preclassifying 
non-__dict__ slot dictionaries to do similar sharing.


CSV reader is an exemplary candidate, because it creates groups of dicts 
that use the same keys. (column names). I have other code that does 
similar things, that would get similar benefits.


Seems like since it is just an interface to existing builtin code, that 
the one interface function (or dictionary factory class) could just as 
well be a builtin function, instead of requiring an import.


Sounds like a similar optimisation to sys.intern() is for strings.

I see no reason to try and avoid an import here - it's definitely a 
special-case situation - but otherwise having a function to say "clone 
and update this dict" that starts by sharing the keys in the same way 
that __dict__ does (including the transformation when necessary) seems 
like an okay addition. Maybe copy() could just be enabled for this?


Cheers,
Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-22 Thread Glenn Linderman

On 4/22/2019 5:19 PM, Steven D'Aprano wrote:

Oh, you mean just like regular dicts with shared keys already do:-)

https://www.python.org/dev/peps/pep-0412/

Perhaps I've missed something in this discussion, but isn't this a
matter of just making the existing shared-keys functionality explicitly
usable rather than just purely implicit? Quoting from the PEP:

   When dictionaries are created to fill the __dict__ slot of an object,
   they are created in split form. The keys table is cached in the type,
   potentially allowing all attribute dictionaries of instances of one
   class to share keys. In the event of the keys of these dictionaries
   starting to diverge, individual dictionaries will lazily convert to
   the combined-table form.

There's no explicit interface to control this; it all happens by magic,
behind the scenes. I think the proposal here is to add some sort of
interface, possibly a new method, to explicitly use key sharing.


Thanks for the PEP reference; I'd forgotten some of the details, and 
hadn't yet gone to look them up.


Yes, it is all magic, but is only available for object __dict__ slot 
dicts. I'd forgotten that that was the "detection" mechanism.  In the 
general case, it would be too time-consuming to examine all existing 
dicts to discover some that might accidentally have the same keys, 
whereas Mark realized that objects very frequently have __dict__ slot 
dictionaries with the same keys, and were ripe for (significant memory 
and minor performance) optimization.


Inada is now proposing a way to allow the coder to suggest a group of 
dictionaries that might benefit from the same gains, by preclassifying 
non-__dict__ slot dictionaries to do similar sharing.


CSV reader is an exemplary candidate, because it creates groups of dicts 
that use the same keys. (column names). I have other code that does 
similar things, that would get similar benefits.


Seems like since it is just an interface to existing builtin code, that 
the one interface function (or dictionary factory class) could just as 
well be a builtin function, instead of requiring an import.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-22 Thread Steven D'Aprano
On Mon, Apr 22, 2019 at 10:06:20AM -0700, Chris Barker via Python-Dev wrote:

> maybe a new dict mapping type -- "shared_dict" -- it would be used in
> places like the csv reader where it makes sense, but wouldn't impact the
> regular dict at all.
> 
> you could get really clever an have it auto-convert to a regular dict when
> any changes were made that are incompatible with the shared keys...

Oh, you mean just like regular dicts with shared keys already do :-)

https://www.python.org/dev/peps/pep-0412/

Perhaps I've missed something in this discussion, but isn't this a 
matter of just making the existing shared-keys functionality explicitly 
usable rather than just purely implicit? Quoting from the PEP:

  When dictionaries are created to fill the __dict__ slot of an object, 
  they are created in split form. The keys table is cached in the type, 
  potentially allowing all attribute dictionaries of instances of one 
  class to share keys. In the event of the keys of these dictionaries 
  starting to diverge, individual dictionaries will lazily convert to 
  the combined-table form. 

There's no explicit interface to control this; it all happens by magic, 
behind the scenes. I think the proposal here is to add some sort of 
interface, possibly a new method, to explicitly use key sharing.


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-22 Thread Glenn Linderman

On 4/22/2019 4:03 PM, Inada Naoki wrote:

On Tue, Apr 23, 2019 at 2:18 AM Chris Barker via Python-Dev
 wrote:

On Fri, Apr 12, 2019 at 10:20 AM Brett Cannon  wrote:



This doesn't strike me as needing an optimization through a dedicated method.

maybe a new dict mapping type -- "shared_dict" -- it would be used in places 
like the csv reader where it makes sense, but wouldn't impact the regular dict at all.

you could get really clever an have it auto-convert to a regular dict when any 
changes were made that are incompatible with the shared keys...


My current idea is adding builder in somewhere in stdlib (maybe collections?):

   builder = DictBuilder(keys_tuple)
   value = builder(values)  # repeatedly called.

I don't want to add new mapping type because we already have shared key dict,
and changing mapping type may cause backward compatibility problem.


Regards,
As a heavy user of some self-written code that does stuff very similar 
to csv reader, and creates lots of same-key dicts, I'd be supportive of 
a performance enhancing solution here, although I haven't done a 
detailed study of where the time is currently spent.


Is the problem that the existing shared key dict isn't always detected? 
Or just that knowing in advance that it is expected to be a shared key 
dict can save the detection work?


I do know that in my code, I have a complete list of keys and values 
when I create each dict, and would be happy to tweak it to use the most 
performance technique. The above looks like a nice interface, assuming 
that values is expected to be in the same iterable order as keys_tuple 
(but is there a need for keys_tuple to be a tuple? could it be any 
iterable?).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-22 Thread Inada Naoki
On Tue, Apr 23, 2019 at 2:18 AM Chris Barker via Python-Dev
 wrote:
>
> On Fri, Apr 12, 2019 at 10:20 AM Brett Cannon  wrote:
>>>
>>>
>> This doesn't strike me as needing an optimization through a dedicated method.
>
> maybe a new dict mapping type -- "shared_dict" -- it would be used in places 
> like the csv reader where it makes sense, but wouldn't impact the regular 
> dict at all.
>
> you could get really clever an have it auto-convert to a regular dict when 
> any changes were made that are incompatible with the shared keys...


My current idea is adding builder in somewhere in stdlib (maybe collections?):

  builder = DictBuilder(keys_tuple)
  value = builder(values)  # repeatedly called.

I don't want to add new mapping type because we already have shared key dict,
and changing mapping type may cause backward compatibility problem.


Regards,
-- 
Inada Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-22 Thread Chris Barker via Python-Dev
On Fri, Apr 12, 2019 at 10:20 AM Brett Cannon  wrote:

>
>> This doesn't strike me as needing an optimization through a dedicated
> method.
>

maybe a new dict mapping type -- "shared_dict" -- it would be used in
places like the csv reader where it makes sense, but wouldn't impact the
regular dict at all.

you could get really clever an have it auto-convert to a regular dict when
any changes were made that are incompatible with the shared keys...

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-12 Thread Brett Cannon
On Fri, Apr 12, 2019 at 8:35 AM Serhiy Storchaka 
wrote:

> 12.04.19 16:44, Inada Naoki пише:
> > When creating many dicts with same keys, dict need to
> > lookup internal hash table while inserting each keys.
> >
> > It is costful operation.  If we can reuse existing keys of dict,
> > we can skip this inserting cost.
> >
> > Additionally, we have "Key-Sharing Dictionary (PEP 412)".
> > When all keys are string, many dict can share one key.
> > It reduces memory consumption.
>
> It looks contrary to simplification made in Python 3 when we get rid of
> some more efficient lists in favor of more general iterators.
>
> If this is a common case


I think that "if" is my big sticking point. I don't think I've ever had a
need for this and the zip() solution was what I originally thought of when
I realized what the method was meant to do (which wasn't obvious to me
initially).

This doesn't strike me as needing an optimization through a dedicated
method.

-Brett


> we can add an invisible optimization for
> dict(zip(keys, values)), especially if keys is a key-sharing dictionary.
> This will benefit all users without the need to rewrite the code to use
> the new special method.
>
> The interface of dict is already overloaded. It contains many methods
> which most users use rarely (and therefore which are not kept in the
> working set of memory).
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/brett%40python.org
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-12 Thread Inada Naoki
On Sat, Apr 13, 2019 at 12:38 AM Serhiy Storchaka  wrote:
>
> It looks contrary to simplification made in Python 3 when we get rid of
> some more efficient lists in favor of more general iterators.
>

Yes.  This is API for special use case creates many dict having
same keys, like csv.DictReader.
It is not good design for general purpose.

strings module has strings.Template class.  But I don't want to add
dicts module.
Maybe, collections.DictBuilder may be another option.  e.g.

>>> from collections import DictBuilder
>>> builder = DictBuilder(tuple("abc"))
>>> builder.build(range(3))
{"a": 0, "b": 1, "c": 2}


> If this is a common case we can add an invisible optimization for
> dict(zip(keys, values)), especially if keys is a key-sharing dictionary.
> This will benefit all users without the need to rewrite the code to use
> the new special method.

But this optimization may slow down when creating one dict...

>
> The interface of dict is already overloaded. It contains many methods
> which most users use rarely (and therefore which are not kept in the
> working set of memory).

Yes.

-- 
Inada Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-12 Thread Inada Naoki
On Fri, Apr 12, 2019 at 11:31 PM Victor Stinner  wrote:
>
> Nice optimization! I have questions on the proposed API.
>
> > with_values(self, iterable, /)
> > Create a new dictionary with keys from this dict and values from 
> > iterable.
> >
> > When length of iterable is different from len(self), ValueError is 
> > raised.
> > This method does not support dict subclass.
>
> In short, mydict.with_values(values) behaves as
> dict(zip(mydict.keys(), values)), but is more efficient?

Yes. But unlike zip, keys() and values must have exactly same length.

>
> The method rely on the fact that dict is preserving key insertion order, 
> right?
>

Yes.

> > This might be usable for:
> >
> > * csv.DictReader
> > * namedtuple._asdict()
> > * DB-API 2.0 implementations:  (e.g. DictCursor of mysqlclient-python)
>
> I guess that a new dict constructor taken keys and values like
> dict.from_keys_and_values(keys, values) would work, but would not
> benefit from the dict key-sharing optimization?
>

I don't like more overloading.
And this API is specialized to build multiple dicts, not one dict.
So I want to have dedicated API for it.

> Would it be possible to implement the key-sharing optimization using a
> dict.from_keys_and_values(mydict.keys(), values) method: detect that
> keys are owned by a dict, and so create a new dict linked to the keys
> dict? A dict view contains a reference to the iterated dict
> (dictiterobject.di_dict).

I think it is possible.
>
> I'm fine with dict.with_values() API, but I'm asking if it could be
> written differently.
>
> Victor

I implemented it as instance method of dict
because it may modify the dict internally (at first invocation).

-- 
Inada Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-12 Thread Serhiy Storchaka

12.04.19 16:44, Inada Naoki пише:

When creating many dicts with same keys, dict need to
lookup internal hash table while inserting each keys.

It is costful operation.  If we can reuse existing keys of dict,
we can skip this inserting cost.

Additionally, we have "Key-Sharing Dictionary (PEP 412)".
When all keys are string, many dict can share one key.
It reduces memory consumption.


It looks contrary to simplification made in Python 3 when we get rid of 
some more efficient lists in favor of more general iterators.


If this is a common case we can add an invisible optimization for 
dict(zip(keys, values)), especially if keys is a key-sharing dictionary. 
This will benefit all users without the need to rewrite the code to use 
the new special method.


The interface of dict is already overloaded. It contains many methods 
which most users use rarely (and therefore which are not kept in the 
working set of memory).


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal: dict.with_values(iterable)

2019-04-12 Thread Victor Stinner
Nice optimization! I have questions on the proposed API.

> with_values(self, iterable, /)
> Create a new dictionary with keys from this dict and values from iterable.
>
> When length of iterable is different from len(self), ValueError is raised.
> This method does not support dict subclass.

In short, mydict.with_values(values) behaves as
dict(zip(mydict.keys(), values)), but is more efficient?

The method rely on the fact that dict is preserving key insertion order, right?


Le ven. 12 avr. 2019 à 15:47, Inada Naoki  a écrit :
> This might be usable for:
>
> * csv.DictReader
> * namedtuple._asdict()
> * DB-API 2.0 implementations:  (e.g. DictCursor of mysqlclient-python)

I guess that a new dict constructor taken keys and values like
dict.from_keys_and_values(keys, values) would work, but would not
benefit from the dict key-sharing optimization?

Would it be possible to implement the key-sharing optimization using a
dict.from_keys_and_values(mydict.keys(), values) method: detect that
keys are owned by a dict, and so create a new dict linked to the keys
dict? A dict view contains a reference to the iterated dict
(dictiterobject.di_dict).

I'm fine with dict.with_values() API, but I'm asking if it could be
written differently.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com