Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-31 Thread Ralf Gommers
On Sun, May 30, 2021 at 10:41 AM  wrote:

> >
> >
> > On Fri, May 28, 2021 at 4:58 PM  > > wrote:
> >
> > Hi all,
> >
> > Finding topk elements is widely used in several fields, but missed
> > in NumPy.
> > I implement this functionality named as  numpy.topk using core numpy
> > functions and open a PR:
> >
> > https://github.com/numpy/numpy/pull/19117
> > 
> >
> > Any discussion are welcome.
> >
> >
> > Thanks for the proposal Kang. I think this functionality is indeed a
> > fairly obvious gap in what Numpy offers, and would make sense to add.
> > A detailed comparison with other libraries would be very helpful here.
> > TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and
> > MXNet call it `topk`.
> >
> > Two things to look at in more detail here are:
> > 1. complete signatures of the function in each of those libraries, and
> > what the commonality is there.
> > 2. the argument Eric made on your PR about consistency with
> > sort/argsort, and if we want topk/argtopk? Also, do other libraries
> > have `argtopk`?
> >
> > Cheers,
> > Ralf
> >
> >
> > Best wishes,
> >
> > Kang Kai
> >
>
> Hi, Thanks for reply, I present some details below:
>

Thanks for the detailed investigation Kang!


>
> ## 1. complete signatures of the function in each of those libraries, and 
> what the commonality is there.
>
>
> | Library | Name   | arg1  | arg2 | arg3 | arg4  | arg5   
> |
>
> |-||---|--|--|---||
> | NumPy [1
> ]   | numpy.topk | a | k| axis | largest   | sorted |
> | PyTorch [2
> ] | torch.topk | input | k| dim  | largest   | sorted |
> | R [3
> ]   | topK   | x | K| /| / | /  |
> | MXNet [4
> ]   | mxnet.npx.topk | data  | k| axis | is_ascend | /  |
> | CNTK [5
> ]| cntk.ops.top_k | x | k| axis | / | /  |
> | TF [6
> ]  | tf.math.top_k  | input | k| /| / | sorted |
> | Dask [7
> ]| dask.array.topk| a | k| axis | -k| /  |
> | Dask [8
> ]| dask.array.argtopk | a | k| axis | -k| /  |
> | MATLAB [9
> ]  | mink   | A | k| dim  | / | /  |
> | MATLAB [10
> ] | maxk   | A | k| dim  | / | /  |
>
>
> | Library | Name   | Returns |
> |-||-|
> | NumPy [1]   | numpy.topk | values, indices |
> | PyTorch [2] | torch.topk | values, indices |
> | R [3]   | topK   | indices |
> | MXNet [4]   | mxnet.npx.topk | controls by ret_typ |
> | CNTK [5]| cntk.ops.top_k | values, indices |
> | TF [6]  | tf.math.top_k  | values, indices |
> | Dask [7]| dask.array.topk| values  |
> | Dask [8]| dask.array.argtopk | indices |
> | MATLAB [9]  | mink   | values, indices |
> | MATLAB [10] | maxk   | values, indices |
>
> - arg1: Input array.
> - arg2: Number of top elements to look for along the given axis.
> - arg3: Axis along which to find topk.
> - R only supports vector, TensorFlow only supports axis=-1.
> - arg4: Controls whether to return k largest or smallest elements.
> - R, CNTK and TensorFlow only return k largest elements.
> -
>  In Dask, k can be negative, which means to return k smallest elements.
> - In MATLAB, use two distinct functions.
> - arg5: If true the resulting k elements will be sorted by the values.
> - R, MXNet, CNTK, Dask and MATLAB only return sorted elements.
>
> **Summary**:
> - Function Name: could be `topk`, `top_k`, `mink`/`maxk`.
> - arg1 (a), arg2 (k), arg3 (axis): should be required.
> - arg4 (largest), arg4 (sorted): might be discussed.
> - Returns: discussed below.
>
>
> ## 2. the argument Eric made on your PR about consistency with sort/argsort, 
> if we want topk/argtopk? Also, do other libraries have `argtopk`
>
> In most libraries, `topk` or `top_k` returns both values and indices, and
> `argtopk` is not included except for Dask. In addition, there is another
> inconsistency: `sort` returns ascending values, but `topk` returns
> descending values.
>
> ## Suggestions
> Finally, IMHO, new function signature might be designed as one of:
> I) use `topk` / `argtopk` or `top_k` / `argtop_k`
> ```python
> def topk(a, k, axis=-1, sorted=True) -> topk_values
> def argtopk(a, k, axis=-1, sorted=True) -> topk_indices
> ```
> or
> ```python
> def top_k(a, k, axis=-1, sorted=True) -> topk_values
> def argtop_k(a, k, axis=-1, sorted=True) -> topk_indices
> ```
> where `k` can be negative which means to return k smallest elements.
>

I don't think I'm a fan of the `-k` cleverness. Saying you want `-5` values
as a 

Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-31 Thread Ralf Gommers
On Sun, May 30, 2021 at 10:01 AM Matti Picus  wrote:

>
>
> Did this function come up at all in the array-API consortium dicussions?
>

It happens to be in this list of functions which was made last week:
https://github.com/data-apis/array-api/issues/187. That list is potential
next candidates, based on them being implemented in most but not all
libraries. There was no real discussion on `topk` specifically though.

The current version of the array API standard basically contains
functionality that is either common to all libraries, or that NumPy has and
most other libraries have as well. Given how much harder it is to get
functions into NumPy than in other libraries, the "most libraries have it,
NumPy does not" set of functions was not investigated much yet. That's also
the reason NEP 47 doesn't have any new functions to be added to NumPy
except for `from_dlpack`, but only consistency changes like adding keepdims
keywords, stacking for linalg functions that are missing that, etc.

Cheers,
Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-31 Thread Jonathan Fine
Here's my opinion, as a bit of an outsider. Mainly, I understand MAX to
mean the largest value in a finite totally ordered set. I understand TOP to
mean the 'best' member of a finite set.

For example, on a mountain each point has a HEIGHT. There will be a MAX
HEIGHT. The point(s) on the mountain that is the highest is the SUMMIT. Or
in other words the TOP of the mountain. Or another example, there are TOP
40 charts for music. https://www.officialcharts.com/

To summarize, use MAX for the largest value in a totally ordered set. Use
TOP when you have a height (or similar) function applied to an unordered
set. The highest temperature in 2021 will occur on the hottest day(s). One
is a temperature, the other a date.

I'm an outsider, and I've not made an effort to gain special knowledge
about the domain prior to posting this opinion. I hope it helps. Please
ignore it if it does not.

-- 
Jonathan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Benjamin Root
to be honest, I read "topk" as "topeka", but I am weird. While numpy
doesn't use underscores all that much, I think this is one case where it
makes sense.

I'd also watch out for the use of the term "sorted", as it may mean
different things to different people, particularly with regards to what its
default value should be. I also find myself initially confused by the names
"largest" and "sorted", especially what should they mean with the "min-k"
behavior. I think Dask's use of negative k is very pythonic and would help
keep the namespace clean by avoiding the extra "min_k".

As for the indices, I am of two minds. On the one hand, I don't like
polluting the namespace with extra functions. On the other hand, having a
function that behaves differently based on a parameter is just fugly,
although we do have a function that does this - np.unique().

Ben Root

On Sun, May 30, 2021 at 8:22 AM Neal Becker  wrote:

> Topk is a bad choice imo.  I initially parsed it as to_pk, and had no idea
> what that was, although sounded a lot like a scipy signal function.
> Nlargest would be very obvious.
>
> On Sun, May 30, 2021, 7:50 AM Alan G. Isaac  wrote:
>
>> Mathematica and Julia both seem relevant here.
>> Mma has TakeLargest (and Wolfram tends to think hard about names).
>> https://reference.wolfram.com/language/ref/TakeLargest.html
>> Julia's closest comparable is perhaps partialsortperm:
>> https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsortperm
>> Alan Isaac
>>
>>
>>
>> On 5/30/2021 4:40 AM, kang...@mail.ustc.edu.cn wrote:
>> > Hi, Thanks for reply, I present some details below:
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread kangkai
>
>
> On Fri, May 28, 2021 at 4:58 PM  > wrote:
>
> Hi all,
>
> Finding topk elements is widely used in several fields, but missed
> in NumPy.
> I implement this functionality named as  numpy.topk using core numpy
> functions and open a PR:
>
> https://github.com/numpy/numpy/pull/19117
> 
>
> Any discussion are welcome.
>
>
> Thanks for the proposal Kang. I think this functionality is indeed a 
> fairly obvious gap in what Numpy offers, and would make sense to add. 
> A detailed comparison with other libraries would be very helpful here. 
> TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and 
> MXNet call it `topk`.
>
> Two things to look at in more detail here are:
> 1. complete signatures of the function in each of those libraries, and 
> what the commonality is there.
> 2. the argument Eric made on your PR about consistency with 
> sort/argsort, and if we want topk/argtopk? Also, do other libraries 
> have `argtopk`?
>
> Cheers,
> Ralf
>
>
> Best wishes,
>
> Kang Kai
>


Hi, Thanks for reply, I present some details below: 


## 1. complete signatures of the function in each of those libraries, and what 
the commonality is there.


| Library | Name   | arg1  | arg2 | arg3 | arg4  | arg5   |
|-||---|--|--|---||
| NumPy [1]   | numpy.topk | a | k| axis | largest   | sorted |
| PyTorch [2] | torch.topk | input | k| dim  | largest   | sorted |
| R [3]   | topK   | x | K| /| / | /  |
| MXNet [4]   | mxnet.npx.topk | data  | k| axis | is_ascend | /  |
| CNTK [5]| cntk.ops.top_k | x | k| axis | / | /  |
| TF [6]  | tf.math.top_k  | input | k| /| / | sorted |
| Dask [7]| dask.array.topk| a | k| axis | -k| /  |
| Dask [8]| dask.array.argtopk | a | k| axis | -k| /  |
| MATLAB [9]  | mink   | A | k| dim  | / | /  |
| MATLAB [10] | maxk   | A | k| dim  | / | /  |



| Library | Name   | Returns | 
|-||-| 
| NumPy [1]   | numpy.topk | values, indices | 
| PyTorch [2] | torch.topk | values, indices | 
| R [3]   | topK   | indices | 
| MXNet [4]   | mxnet.npx.topk | controls by ret_typ | 
| CNTK [5]| cntk.ops.top_k | values, indices | 
| TF [6]  | tf.math.top_k  | values, indices | 
| Dask [7]| dask.array.topk| values  | 
| Dask [8]| dask.array.argtopk | indices | 
| MATLAB [9]  | mink   | values, indices |
| MATLAB [10] | maxk   | values, indices |


- arg1: Input array.
- arg2: Number of top elements to look for along the given axis.
- arg3: Axis along which to find topk.
- R only supports vector, TensorFlow only supports axis=-1.
- arg4: Controls whether to return k largest or smallest elements.
- R, CNTK and TensorFlow only return k largest elements.
- In Dask, k can be negative, which means to return k smallest elements.
- In MATLAB, use two distinct functions.
- arg5: If true the resulting k elements will be sorted by the values.
- R, MXNet, CNTK, Dask and MATLAB only return sorted elements.

**Summary**:
- Function Name: could be `topk`, `top_k`, `mink`/`maxk`.
- arg1 (a), arg2 (k), arg3 (axis): should be required.
- arg4 (largest), arg4 (sorted): might be discussed.
- Returns: discussed below.


## 2. the argument Eric made on your PR about consistency with sort/argsort, if 
we want topk/argtopk? Also, do other libraries have `argtopk`


In most libraries, `topk` or `top_k` returns both values and indices, and 
`argtopk` is not included except for Dask. In addition, there is another 
inconsistency: `sort` returns ascending values, but `topk` returns 
descending values.


## Suggestions
Finally, IMHO, new function signature might be designed as one of:
I) use `topk` / `argtopk` or `top_k` / `argtop_k`
```python
def topk(a, k, axis=-1, sorted=True) -> topk_values
def argtopk(a, k, axis=-1, sorted=True) -> topk_indices
```
or
```python
def top_k(a, k, axis=-1, sorted=True) -> topk_values
def argtop_k(a, k, axis=-1, sorted=True) -> topk_indices
```
where `k` can be negative which means to return k smallest elements.


II) use `maxk` / `argmaxk` or `max_k` / `argmax_k` (`mink` / `argmink` or 
`min_k` / `argmin_k`)
```python
def maxk(a, k, axis=-1, sorted=True) -> values
def argmaxk(a, k, axis=-1, sorted=True) -> indices


def mink(a, k, axis=-1, sorted=True) -> values
def argmink(a, k, axis=-1, sorted=True) -> indices
```
or
```python
def max_k(a, k, axis=-1, sorted=True) -> values
def argmax_k(a, k, axis=-1, 

Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Neal Becker
Topk is a bad choice imo.  I initially parsed it as to_pk, and had no idea
what that was, although sounded a lot like a scipy signal function.
Nlargest would be very obvious.

On Sun, May 30, 2021, 7:50 AM Alan G. Isaac  wrote:

> Mathematica and Julia both seem relevant here.
> Mma has TakeLargest (and Wolfram tends to think hard about names).
> https://reference.wolfram.com/language/ref/TakeLargest.html
> Julia's closest comparable is perhaps partialsortperm:
> https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsortperm
> Alan Isaac
>
>
>
> On 5/30/2021 4:40 AM, kang...@mail.ustc.edu.cn wrote:
> > Hi, Thanks for reply, I present some details below:
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Alan G. Isaac

Mathematica and Julia both seem relevant here.
Mma has TakeLargest (and Wolfram tends to think hard about names).
https://reference.wolfram.com/language/ref/TakeLargest.html
Julia's closest comparable is perhaps partialsortperm:
https://docs.julialang.org/en/v1/base/sort/#Base.Sort.partialsortperm
Alan Isaac



On 5/30/2021 4:40 AM, kang...@mail.ustc.edu.cn wrote:

Hi, Thanks for reply, I present some details below:

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Alan G. Isaac

Is there any thought of allowing for other comparisons?
In which case `last_k` might be preferable.
Alan Isaac

On 5/30/2021 2:38 AM, Ilhan Polat wrote:


I think "max_k" is a good generalization of the regular "max".

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Daniele Nicolodi
On 30/05/2021 00:48, Robert Kern wrote:
> On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi  > wrote:
> 
> What does k stand for here? As someone that never encountered this
> function before I find both names equally confusing. If I understand
> what the function is supposed to be doing, I think largest() would be
> much more descriptive.
> 
> 
> `k` is the number of elements to return. `largest()` can connote that
> it's only returning the one largest value. It's fairly typical to
> include a dummy variable (`k` or `n`) in the name to indicate that the
> function lets you specify how many you want. See, for example, the
> stdlib `heapq` module's `nlargest()` function.

I thought that a `largest()` function with an integer second argument
could be enough self explanatory. `nlargest()` would be much more
obvious to the wider audience, I think.

> https://docs.python.org/3/library/heapq.html#heapq.nlargest
> 
> 
> "top-k" comes from the ML community where this function is used to
> evaluate classification models (`k` instead of `n` being largely an
> accident of history, I imagine). In many classification problems, the
> number of classes is very large, and they are very related to each
> other. For example, ImageNet has a lot of different dog breeds broken
> out as separate classes. In order to get a more balanced view of the
> relative performance of the classification models, you often want to
> check whether the correct class is in the top 5 classes (or whatever `k`
> is appropriate) that the model predicted for the example, not just the
> one class that the model says is the most likely. "5 largest" doesn't
> really work in the sentences that one usually writes when talking about
> ML classifiers; they are talking about the 5 classes that are associated
> with the 5 largest values from the predictor, not the values themselves.
> So "top k" is what gets used in ML discussions, and that transfers over
> to the name of the function in ML libraries.
> 
> It is a top-down reflection of the higher level thing that people want
> to compute (in that context) rather than a bottom-up description of how
> the function is manipulating the input, if that makes sense. Either one
> is a valid way to name things. There is a lot to be said for numpy's
> domain-agnostic nature that we should prefer the bottom-up description
> style of naming. However, we are also in the midst of a diversifying
> ecosystem of array libraries, largely driven by the ML domain, and
> adopting some of that terminology when we try to enhance our
> interoperability with those libraries is also a factor to be considered.

I think that such a simple function should be named in the most obvious
way possible, or it will become one function that will be used in the
domains where the unusual name makes sense, but will end being
re-implemented in all other contexts. I am sure that if I would have
been looking for a function that returns the N largest items in an array
(being that intended accordingly to a given key function or otherwise) I
would never have looked at a function named `topk()` or `top_k()` and I
am pretty sure I would have discarded anything that has `k` or `top` in
its name.

On the other hand, I understand that ML is where all the hipe (and a
large fraction of the money) is this days, thus I understand if numpy
wants to appease the crowd.

Cheers,
Dan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Matti Picus


On 29/5/21 5:28 pm, Ralf Gommers wrote:



On Fri, May 28, 2021 at 4:58 PM > wrote:


Hi all,

Finding topk elements is widely used in several fields, but missed
in NumPy.
I implement this functionality named as  numpy.topk using core numpy
functions and open a PR:

https://github.com/numpy/numpy/pull/19117


Any discussion are welcome.


Thanks for the proposal Kang. I think this functionality is indeed a 
fairly obvious gap in what Numpy offers, and would make sense to add. 
A detailed comparison with other libraries would be very helpful here. 
TensorFlow and JAX call this function `top_k`, while PyTorch, Dask and 
MXNet call it `topk`.


Two things to look at in more detail here are:
1. complete signatures of the function in each of those libraries, and 
what the commonality is there.
2. the argument Eric made on your PR about consistency with 
sort/argsort, and if we want topk/argtopk? Also, do other libraries 
have `argtopk`?


Cheers,
Ralf


Best wishes,

Kang Kai



Did this function come up at all in the array-API consortium dicussions?

Matti

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Ilhan Polat
after a coffee, I don't see the point of calling it still "k" so "max_n" is
my vote for what its worth.

On Sun, May 30, 2021 at 8:38 AM Ilhan Polat  wrote:

> Since this going into the top namespace, I'd also vote against the
> matlab-y "topk" name. And even matlab didn't do what I would expect and
> went with maxk
>
> https://nl.mathworks.com/help/matlab/ref/maxk.html
>
> I think "max_k" is a good generalization of the regular "max". Even when
> auto-completing, this showing up under max makes sense to me instead of
> searching them inside "t"s. Besides, "argmax_k" also follows suite, that of
> the previous convention. To my eyes this is an acceptable disturbance to an
> already very crowded namespace.
>
>
>
> a few moments later
>
> But then again an ugly idea rears its head proposing this going into the
> existing max function. But I'll shut up now :)
>
>
>
>
>
>
>
> On Sun, May 30, 2021 at 12:50 AM Robert Kern 
> wrote:
>
>> On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi 
>> wrote:
>>
>>> What does k stand for here? As someone that never encountered this
>>> function before I find both names equally confusing. If I understand
>>> what the function is supposed to be doing, I think largest() would be
>>> much more descriptive.
>>>
>>
>> `k` is the number of elements to return. `largest()` can connote that
>> it's only returning the one largest value. It's fairly typical to include a
>> dummy variable (`k` or `n`) in the name to indicate that the function lets
>> you specify how many you want. See, for example, the stdlib `heapq`
>> module's `nlargest()` function.
>>
>> https://docs.python.org/3/library/heapq.html#heapq.nlargest
>>
>> "top-k" comes from the ML community where this function is used to
>> evaluate classification models (`k` instead of `n` being largely an
>> accident of history, I imagine). In many classification problems, the
>> number of classes is very large, and they are very related to each other.
>> For example, ImageNet has a lot of different dog breeds broken out as
>> separate classes. In order to get a more balanced view of the relative
>> performance of the classification models, you often want to check whether
>> the correct class is in the top 5 classes (or whatever `k` is appropriate)
>> that the model predicted for the example, not just the one class that the
>> model says is the most likely. "5 largest" doesn't really work in the
>> sentences that one usually writes when talking about ML classifiers; they
>> are talking about the 5 classes that are associated with the 5 largest
>> values from the predictor, not the values themselves. So "top k" is what
>> gets used in ML discussions, and that transfers over to the name of the
>> function in ML libraries.
>>
>> It is a top-down reflection of the higher level thing that people want to
>> compute (in that context) rather than a bottom-up description of how the
>> function is manipulating the input, if that makes sense. Either one is a
>> valid way to name things. There is a lot to be said for numpy's
>> domain-agnostic nature that we should prefer the bottom-up description
>> style of naming. However, we are also in the midst of a diversifying
>> ecosystem of array libraries, largely driven by the ML domain, and adopting
>> some of that terminology when we try to enhance our interoperability with
>> those libraries is also a factor to be considered.
>>
>> --
>> Robert Kern
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-30 Thread Ilhan Polat
Since this going into the top namespace, I'd also vote against the matlab-y
"topk" name. And even matlab didn't do what I would expect and went with
maxk

https://nl.mathworks.com/help/matlab/ref/maxk.html

I think "max_k" is a good generalization of the regular "max". Even when
auto-completing, this showing up under max makes sense to me instead of
searching them inside "t"s. Besides, "argmax_k" also follows suite, that of
the previous convention. To my eyes this is an acceptable disturbance to an
already very crowded namespace.



a few moments later

But then again an ugly idea rears its head proposing this going into the
existing max function. But I'll shut up now :)







On Sun, May 30, 2021 at 12:50 AM Robert Kern  wrote:

> On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi 
> wrote:
>
>> What does k stand for here? As someone that never encountered this
>> function before I find both names equally confusing. If I understand
>> what the function is supposed to be doing, I think largest() would be
>> much more descriptive.
>>
>
> `k` is the number of elements to return. `largest()` can connote that it's
> only returning the one largest value. It's fairly typical to include a
> dummy variable (`k` or `n`) in the name to indicate that the function lets
> you specify how many you want. See, for example, the stdlib `heapq`
> module's `nlargest()` function.
>
> https://docs.python.org/3/library/heapq.html#heapq.nlargest
>
> "top-k" comes from the ML community where this function is used to
> evaluate classification models (`k` instead of `n` being largely an
> accident of history, I imagine). In many classification problems, the
> number of classes is very large, and they are very related to each other.
> For example, ImageNet has a lot of different dog breeds broken out as
> separate classes. In order to get a more balanced view of the relative
> performance of the classification models, you often want to check whether
> the correct class is in the top 5 classes (or whatever `k` is appropriate)
> that the model predicted for the example, not just the one class that the
> model says is the most likely. "5 largest" doesn't really work in the
> sentences that one usually writes when talking about ML classifiers; they
> are talking about the 5 classes that are associated with the 5 largest
> values from the predictor, not the values themselves. So "top k" is what
> gets used in ML discussions, and that transfers over to the name of the
> function in ML libraries.
>
> It is a top-down reflection of the higher level thing that people want to
> compute (in that context) rather than a bottom-up description of how the
> function is manipulating the input, if that makes sense. Either one is a
> valid way to name things. There is a lot to be said for numpy's
> domain-agnostic nature that we should prefer the bottom-up description
> style of naming. However, we are also in the midst of a diversifying
> ecosystem of array libraries, largely driven by the ML domain, and adopting
> some of that terminology when we try to enhance our interoperability with
> those libraries is also a factor to be considered.
>
> --
> Robert Kern
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-29 Thread Robert Kern
On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi  wrote:

> What does k stand for here? As someone that never encountered this
> function before I find both names equally confusing. If I understand
> what the function is supposed to be doing, I think largest() would be
> much more descriptive.
>

`k` is the number of elements to return. `largest()` can connote that it's
only returning the one largest value. It's fairly typical to include a
dummy variable (`k` or `n`) in the name to indicate that the function lets
you specify how many you want. See, for example, the stdlib `heapq`
module's `nlargest()` function.

https://docs.python.org/3/library/heapq.html#heapq.nlargest

"top-k" comes from the ML community where this function is used to evaluate
classification models (`k` instead of `n` being largely an accident of
history, I imagine). In many classification problems, the number of classes
is very large, and they are very related to each other. For example,
ImageNet has a lot of different dog breeds broken out as separate classes.
In order to get a more balanced view of the relative performance of the
classification models, you often want to check whether the correct class is
in the top 5 classes (or whatever `k` is appropriate) that the model
predicted for the example, not just the one class that the model says is
the most likely. "5 largest" doesn't really work in the sentences that one
usually writes when talking about ML classifiers; they are talking about
the 5 classes that are associated with the 5 largest values from the
predictor, not the values themselves. So "top k" is what gets used in ML
discussions, and that transfers over to the name of the function in ML
libraries.

It is a top-down reflection of the higher level thing that people want to
compute (in that context) rather than a bottom-up description of how the
function is manipulating the input, if that makes sense. Either one is a
valid way to name things. There is a lot to be said for numpy's
domain-agnostic nature that we should prefer the bottom-up description
style of naming. However, we are also in the midst of a diversifying
ecosystem of array libraries, largely driven by the ML domain, and adopting
some of that terminology when we try to enhance our interoperability with
those libraries is also a factor to be considered.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-29 Thread Daniele Nicolodi
On 29/05/2021 18:33, David Menéndez Hurtado wrote:
> 
> 
> On Sat, 29 May 2021, 4:29 pm Ralf Gommers,  > wrote:
> 
> 
> 
> On Fri, May 28, 2021 at 4:58 PM  > wrote:
> 
> Hi all,
> 
> Finding topk elements is widely used in several fields, but
> missed in NumPy.
> I implement this functionality named as  numpy.topk using core numpy
> functions and open a PR:
> 
> https://github.com/numpy/numpy/pull/19117
> 
> 
> Any discussion are welcome.
> 
> 
> Thanks for the proposal Kang. I think this functionality is indeed a
> fairly obvious gap in what Numpy offers, and would make sense to
> add. A detailed comparison with other libraries would be very
> helpful here. TensorFlow and JAX call this function `top_k`, while
> PyTorch, Dask and MXNet call it `topk`.
> 
> 
> When I saw `topk` I initially parsed it as "to pk", similar to the
> current `tolist`. I think `top_k` is more explicit and clear.

What does k stand for here? As someone that never encountered this
function before I find both names equally confusing. If I understand
what the function is supposed to be doing, I think largest() would be
much more descriptive.

Cheers,
Dan
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-29 Thread Ralf Gommers
On Fri, May 28, 2021 at 4:58 PM  wrote:

> Hi all,
>
> Finding topk elements is widely used in several fields, but missed in
> NumPy.
> I implement this functionality named as  numpy.topk using core numpy
> functions and open a PR:
>
> https://github.com/numpy/numpy/pull/19117
>
> Any discussion are welcome.
>

Thanks for the proposal Kang. I think this functionality is indeed a fairly
obvious gap in what Numpy offers, and would make sense to add. A detailed
comparison with other libraries would be very helpful here. TensorFlow and
JAX call this function `top_k`, while PyTorch, Dask and MXNet call it
`topk`.

Two things to look at in more detail here are:
1. complete signatures of the function in each of those libraries, and what
the commonality is there.
2. the argument Eric made on your PR about consistency with sort/argsort,
and if we want topk/argtopk? Also, do other libraries have `argtopk`?

Cheers,
Ralf



> Best wishes,
>
> Kang Kai
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-28 Thread kangkai
Hi all,


Finding topk elements is widely used in several fields, but missed in NumPy.
I implement this functionality named as  numpy.topk using core numpy
functions and open a PR:


https://github.com/numpy/numpy/pull/19117


Any discussion are welcome.
   
Best wishes,


Kang Kai___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion