Re: [Python-ideas] grouping / dict of lists

2018-06-30 Thread Chris Barker via Python-ideas
On Fri, Jun 29, 2018 at 10:53 AM, Michael Selik  wrote:

> I've drafted a PEP for an easier way to construct groups of elements from
> a sequence. https://github.com/selik/peps/blob/master/pep-.rst
>
> I'm really warming to the:

Alternate: collections.Grouping

version -- I really like this as a kind of custom mapping, rather than
"just a function" (or alternate constructor) -- and I like your point that
it can have a bit of functionality built in other than on construction.

But I think it should be more like the other collection classes -- i.e. a
general purpose class that can be used for grouping, but also used more
general-purpose-y as well. That way people can do their "custom" stuff (key
function, etc.) with comprehensions.

The big differences are a custom __setitem__:

def __setitem__(self, key, value):
self.setdefault(key, []).append(value)

And the __init__ and update would take an iterable of (key, value) pairs,
rather than a single sequence.

This would get away from the itertools.groupby approach, which I find kinda
awkward:

* How often do you have your data in a single sequence?

* Do you need your keys (and values!) to be sortable???)

* Do we really want folks to have to be writing custom key functions and/or
lambdas for really simple stuff?

* and you may need to "transform" both your keys and values

I've enclosed an example implementation, borrowing heavily from Michael's
code.

The test code has a couple examples of use, but I'll put them here for the
sake of discussion.

Michael had:

Grouping('AbBa', key=c.casefold))

with my code, that would be:

Grouping(((c.casefold(), c) for c in 'AbBa'))

Note that the key function is applied outside the Grouping object, it
doesn't need to know anything about it -- and then users can use an
expression in a comprehension rather than a key function.

This looks a tad clumsier with my approach, but this is a pretty contrived
example -- in the more common case [*], you'd be writing a bunch of
lambdas, etc, and I'm not sure there is a way to get the values customized
as well, if you want that. (without applying a map later on)

Here is the example that the OP posted that kicked off this thread:

In [37]: student_school_list = [('Fred', 'SchoolA'),
...:('Bob', 'SchoolB'),
...:('Mary', 'SchoolA'),
...:('Jane', 'SchoolB'),
...:('Nancy', 'SchoolC'),
...:]

In [38]: Grouping(((item[1], item[0]) for item in student_school_list))
Out[38]: Grouping({'SchoolA': ['Fred', 'Mary'],
   'SchoolB': ['Bob', 'Jane'],
   'SchoolC': ['Nancy']})

or

In [40]: Grouping((reversed(item) for item in student_school_list))
Out[40]: Grouping({'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'],
'SchoolC': ['Nancy']})

(note that if those keys and values were didn't have to be reversed, you
could just pass the list in raw.

I really like how I can use a generator expression and simple expressions
to transform the data in the way I need, rather than having to make key
functions.

And with Michael's approach, I think you'd need to call .map() after
generating the grouping -- a much klunkier way to do it. (and you'd get
plain dict rather than a Grouping that you could add stuff too later...)

I'm sure there are ways to improve my code, and maybe Grouping isn't the
best name, but I think something like this would be a nice addition to the
collections module.

-CHB

[*] -- before making any decisions about the best API, it would probably be
a good idea to collect examples of the kind of data that people really do
need to group like this. Does it come in (key, value) pairs naturally? or
in one big sequence with a key function that's easy to write? who knows
without examples of real world use cases.

I will show one "real world" example here:

In my Python classes, I like to use Dave Thomas' trigrams: "code kata":

http://codekata.com/kata/kata14-tom-swift-under-the-milkwood/

A key piece of this is building up a data structure with word pairs, and a
list of all the words that follow the pair in a piece of text.

This is a nice exercise to help people think about how to use dicts, etc.
Currently the most clean code uses .setdefault:

word_pairs = {}
# loop through the words
# (rare case where using the index to loop is easiest)
for i in range(len(words) - 2):  # minus 2, 'cause you need a pair
pair = tuple(words[i:i + 2])  # a tuple so it can be a key in the
dict
follower = words[i + 2]
word_pairs.setdefault(pair, []).append(follower)

if this were done with my Grouping class, it would be:

In [53]: word_pairs = Grouping()

In [54]: for i in range(len(words) - 2):
...: pair = tuple(words[i:i + 2])  # a tuple so it can be a key in
the dict
...: follower = words[i + 2]
...: word_pairs[pair] = follower
...:

In [55]: word_pairs

Re: [Python-ideas] grouping / dict of lists

2018-06-30 Thread Nick Coghlan
On 30 June 2018 at 16:25, Guido van Rossum  wrote:
> On Fri, Jun 29, 2018 at 3:23 PM Michael Selik  wrote:
>> I included an alternate solution of a new class, collections.Grouping,
>> which has some advantages. In addition to having less of that "heavy-handed"
>> feel to it, the class can have a few utility methods that help handle more
>> use cases.
>
>
> Hm, this actually feels heavier to me. But then again I never liked or
> understood the need for Counter -- I prefer basic data types and helper
> functions over custom abstractions. (Also your description doesn't do it
> justice, you describe a class using a verb phrase, "consume a sequence and
> construct a Mapping". The key to Grouping seems to me that it is a dict
> subclass with a custom constructor. But you don't explain why a subclass is
> needed, and in that sense I like the other approach better.

I'm not sure if the draft was updated since you looked at it, but it
does mention that one benefit of the collections.Grouping approach is
being able to add native support for mapping a callable across every
individual item in the collection (ignoring the group structure), as
well as for applying aggregate functions to reduce the groups to
single values in a standard dict.

Delegating those operations to the container API that way then means
that other libraries can expose classes that implement the grouping
API, but with a completely different backend storage model.

> But I still think it is much better off as a helper function in itertools.

I thought we actually had an open enhancement proposal for adding a
"defaultdict.freeze" operation that switched it over to raising
KeyError the same way a normal dict does, but I can't seem to find it
now.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Add a __cite__ method for scientific packages

2018-06-30 Thread Nick Coghlan
On 29 June 2018 at 12:14, Nathaniel Smith  wrote:
> On Thu, Jun 28, 2018 at 2:25 PM, Andrei Kucharavy
>  wrote:
>> As for the list, reserving a __citation__/__cite__ for packages at the same
>> level as __version__ is now reserved and adding a citation()/cite() function
>> to the standard library seemed large enough modifications to warrant
>> searching a buy-in from the maintainers and the community at large.
>
> There isn't actually any formal method for registering special names
> like __version__, and they aren't treated specially by the language.
> They're just variables that happen to have a funny name. You shouldn't
> start using them willy-nilly, but you don't actually have to ask
> permission or anything.

The one caveat on dunder names is that we expressly exempt them from
our usual backwards compatibility guarantees, so it's worth getting
some level of "No, we're not going to do anything that would conflict
with your proposed convention" at the language design level.

> And it's not very likely that someone else
> will come along and propose using the name __citation__ for something
> that *isn't* a citation :-).

Aye, in this case I think you can comfortably assume that we'll
happily leave the "__citation__" and "__cite__" dunder names alone
unless/until there's a clear consensus in the scientific Python
community to use them a particular way.

And even then, it would likely be Python package installers like pip,
Python environment managers like pipenv, and data analysis environment
managers like conda that would handle the task of actually consuming
that metadata (in whatever form it may appear). Having your citation
management support depend on which version of Python you were using
seems like it would be mostly a source of pain rather than beneficial.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] grouping / dict of lists

2018-06-30 Thread Serhiy Storchaka

30.06.18 00:42, Guido van Rossum пише:
On a quick skim I see nothing particularly objectionable or 
controversial in your PEP, except I'm unclear why it needs to be a class 
method on `dict`. Adding something to a builtin like this is rather 
heavy-handed. Is there a really good reason why it can't be a function 
in `itertools`? (I don't think that it's relevant that it doesn't return 
an iterator -- it takes in an iterator.)


Also, your pure-Python implementation appears to be O(N log N) if key is 
None but O(N) otherwise; and the version for key is None uses an extra 
temporary array of size N. Is that intentional?


And it adds a requirement to keys be orderable.

I think there should be two functions with different requirements: for 
hashable and orderable keys. The latter should return a list of pairs or 
a sorted dict if they be supported by the stdlib.


I'm not sure they fit well for the itertools module. Maybe the purposed 
algorithms module would be a better place. Or maybe just keep them as 
recipes in the documentation (they are just few lines). Concrete 
implementation can be simpler than the general implementation.


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fwd: collections.Counter should implement fromkeys

2018-06-30 Thread Tim Peters
[Abe Dillon]
> I haven't been part of the conversation for 15 years, but most of the
argument
> against the idea (yours especially) seem to focus on the prospect of a
> constructor war and imply that was the original motivation behind actively
> disabling the fromkeys method in Counters.

I quoted the source code verbatim - its comment said fromkeys() didn't make
sense for Counters.  From which it's an easy inference that it makes more
than one _kind_ of sense, hence "constructor wars".  Not that it matters.

Giving some of the history was more a matter of giving a plausible reason
for why you weren't getting all that much feedback:  it's quite possible
that most readers of this list didn't even remember that `dict.fromkeys()`
is a thing.

> I don't mean to give the impression that I'm fanatical about this. It
really
> is a minor inconvenience. It doesn't irk me nearly as much as other minor
> things, like that the fact that all the functions in the heapq package
begin
> with the redundant word 'heap'.

You have to blame Guido for that one, which is even more futile than
arguing with Raymond ;-)  It never much bothered me, but I do recall doing
this once:

from heapq import heappush as push, heappop as pop # etc


>> Raymond may have a different judgment about that, though.  I don't
believe
>> he reads python-ideas anymore

> He actually did reply a few comments back!

Ya, I saw that!  He's always trying to make me look bad ;-)

> I think I'm having more fun chatting with people that I deeply respect
> than "jumping up and down". I'm sorry if I'm coming off as an asshole.

Not at all!  I've enjoyed your messages.  They have tended to more on the
side of forceful advocacy than questioning, though, which may grate after a
few more years.  As to my "jumping up and down", I do a lot of
leg-pulling.  I'm old.  It's not meant to offend, but I'm too old to care
if it does :-)

> We can kill this thread if everyone thinks I'm wasting their time. It
doesn't
> look like anyone else shares my minor annoyance. Thanks for indulging me!

Raymond's reply didn't leave any hope for adding Counter.fromkeys(), so in
the absence of a killer argument that hasn't yet been made, ya, it would be
prudent to move on.

Unless people want to keep talking about it, knowing that Raymond won't buy
it in the end.  Decisions, decisions ;-)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] grouping / dict of lists

2018-06-30 Thread Guido van Rossum
On Fri, Jun 29, 2018 at 3:23 PM Michael Selik  wrote:

> On Fri, Jun 29, 2018 at 2:43 PM Guido van Rossum  wrote:
>
>> On a quick skim I see nothing particularly objectionable or controversial
>> in your PEP, except I'm unclear why it needs to be a class method on `dict`.
>>
>
> Since it constructs a basic dict, I thought it belongs best as a dict
> constructor like dict.fromkeys. It seemed to match other classmethods like
> datetime.now.
>

It doesn't strike me as important enough. Surely not every stdlib function
that returns a fresh dict needs to be a class method on dict!


> Adding something to a builtin like this is rather heavy-handed.
>>
>
> I included an alternate solution of a new class, collections.Grouping,
> which has some advantages. In addition to having less of that
> "heavy-handed" feel to it, the class can have a few utility methods that
> help handle more use cases.
>

Hm, this actually feels heavier to me. But then again I never liked or
understood the need for Counter -- I prefer basic data types and helper
functions over custom abstractions. (Also your description doesn't do it
justice, you describe a class using a verb phrase, "consume a sequence and
construct a Mapping". The key to Grouping seems to me that it is a dict
subclass with a custom constructor. But you don't explain why a subclass is
needed, and in that sense I like the other approach better.

But I still think it is much better off as a helper function in itertools.


> Is there a really good reason why it can't be a function in `itertools`?
>> (I don't think that it's relevant that it doesn't return an iterator -- it
>> takes in an iterator.)
>>
>
> I considered placing it in the itertools module, but decided against
> because it doesn't return an iterator. I'm open to that if that's the
> consensus.
>

You'll never get consensus on anything here, but you have my blessing for
this without consensus.


> Also, your pure-Python implementation appears to be O(N log N) if key is
>> None but O(N) otherwise; and the version for key is None uses an extra
>> temporary array of size N. Is that intentional?
>>
>
> Unintentional. I've been drafting pieces of this over the last year and
> wasn't careful enough with proofreading. I'll fix that momentarily...
>

Such are the dangers of premature optimization. :-)


> Finally, the first example under "Group and Aggregate" is described as a
>> dict of sets but it actually returns a dict of (sorted) lists.
>>
>
> Doctest complained at the set ordering, so I sorted for printing. You're
> not the only one to make that point, so I'll use sets for the example and
> ignore doctest.
>
> Thanks for reading!
> -- Michael
>
> PS. I just pushed an update to the GitHub repo, as per these comments.
>

Good luck with your PEP. If it is to go into itertools the biggest hurdle
will be convincing Raymond, and I'm not going to overrule him on this: you
and he are the educators here so hopefully you two can agree.

--Guido


>
>
>> On Fri, Jun 29, 2018 at 10:54 AM Michael Selik  wrote:
>>
>>> Hello,
>>>
>>> I've drafted a PEP for an easier way to construct groups of elements
>>> from a sequence. https://github.com/selik/peps/blob/master/pep-.rst
>>>
>>> As a teacher, I've found that grouping is one of the most awkward tasks
>>> for beginners to learn in Python. While this proposal requires
>>> understanding a key-function, in my experience that's easier to teach than
>>> the nuances of setdefault or defaultdict. Defaultdict requires passing a
>>> factory function or class, similar to a key-function. Setdefault is
>>> awkwardly named and requires a discussion of references and mutability.
>>> Those topics are important and should be covered, but I'd like to let them
>>> sink in gradually. Grouping often comes up as a question on the first or
>>> second day, especially for folks transitioning from Excel.
>>>
>>> I've tested this proposal on actual students (no students were harmed
>>> during experimentation) and found that the majority appreciate it. Some are
>>> even able to guess what it does (would do) without any priming.
>>>
>>> Thanks for your time,
>>> -- Michael
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jun 28, 2018 at 8:38 AM Michael Selik  wrote:
>>>
 On Thu, Jun 28, 2018 at 8:25 AM Nicolas Rolin 
 wrote:

> I use list and dict comprehension a lot, and a problem I often have is
> to do the equivalent of a group_by operation (to use sql terminology).
>
> For example if I have a list of tuples (student, school) and I want to
> have the list of students by school the only option I'm left with is to
> write
>
> student_by_school = defaultdict(list)
> for student, school in student_school_list:
> student_by_school[school].append(student)
>

 Thank you for bringing this up. I've been drafting a proposal for a
 better grouping / group-by operation for a little while. I'm not quite
 ready to share it, as I'm still