Re: [Numpy-discussion] Question about numpy.random.choice with probabilties

2017-01-23 Thread Robert Kern
On Mon, Jan 23, 2017 at 9:41 AM, Nadav Har'El  wrote:
>
> On Mon, Jan 23, 2017 at 4:52 PM, aleba...@gmail.com 
wrote:
>>
>> 2017-01-23 15:33 GMT+01:00 Robert Kern :
>>>
>>> I don't object to some Notes, but I would probably phrase it more like
we are providing the standard definition of the jargon term "sampling
without replacement" in the case of non-uniform probabilities. To my mind
(or more accurately, with my background), "replace=False" obviously picks
out the implemented procedure, and I would have been incredibly surprised
if it did anything else. If the option were named "unique=True", then I
would have needed some more documentation to let me know exactly how it was
implemented.
>>>
>> FWIW, I totally agree with Robert
>
> With my own background (MSc. in Mathematics), I agree that this algorithm
is indeed the most natural one. And as I said, when I wanted to implement
something myself when I wanted to choose random combinations (k out of n
items), I wrote exactly the same one. But when it didn't produce the
desired probabilities (even in cases where I knew that doing this was
possible), I wrongly assumed numpy would do things differently - only to
realize it uses exactly the same algorithm. So clearly, the documentation
didn't quite explain what it does or doesn't do.

In my experience, I have seen "without replacement" mean only one thing. If
the docstring had said "returns unique items", I'd agree that it doesn't
explain what it does or doesn't do. The only issue is that "without
replacement" is jargon, and it is good to recapitulate the definitions of
such terms for those who aren't familiar with them.

> Also, Robert, I'm curious: beyond explaining why the existing algorithm
is reasonable (which I agree), could you give me an example of where it is
actually  *useful* for sampling?

The references I previously quoted list a few. One is called "multistage
sampling proportional to size". The idea being that you draw (without
replacement) from a larger units (say, congressional districts) before
sampling within them. It is similar to the situation you outline, but it is
probably more useful at a different scale, like lots of larger units (where
your algorithm is likely to provide no solution) rather than a handful.

It is probably less useful in terms of survey design, where you are trying
to *design* a process to get a result, than it is in queueing theory and
related fields, where you are trying to *describe* and simulate a process
that is pre-defined.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Question about numpy.random.choice with probabilties

2017-01-23 Thread Nadav Har'El
On Mon, Jan 23, 2017 at 5:47 PM, Robert Kern  wrote:

>
> > As for the standardness of the definition: I don't know, have you a
> reference where it is defined? More natural to me would be to have a list
> of items with integer multiplicities (as in: "cat" 3 times, "dog" 1 time).
> I'm hesitant to claim ours is a standard definition unless it's in a
> textbook somewhere. But I don't insist on my phrasing.
>
> Textbook, I'm not so sure, but it is the *only* definition I've ever
> encountered in the literature:
>
> http://epubs.siam.org/doi/abs/10.1137/0209009
>

Very interesting. This paper (PDF available if you search for its name in
Google) explicitly mentions one of the uses of this algorithm is
"multistage sampling", which appears to be exactly the same thing as in the
hypothetical Gulliver example I gave in my earlier mail.

And yet, I showed in my mail that this algorithm does NOT reproduce the
desired frequency of the different sampling units...

Moreover, this paper doesn't explain why you need the "without replacement"
for this use case (everything seems easier, and the desired probabilities
are reproduced, with replacement).
In my story I gave a funny excuse why "without replacement" might be
warrented, but if you're interested I can tell you a bit about my actual
use case, with a more serious reason why I want without replacement.


> http://www.sciencedirect.com/science/article/pii/S002001900500298X
>
> --
> Robert Kern
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Question about numpy.random.choice with probabilties

2017-01-23 Thread Nadav Har'El
On Mon, Jan 23, 2017 at 4:52 PM, aleba...@gmail.com 
wrote:

>
>
> 2017-01-23 15:33 GMT+01:00 Robert Kern :
>
>>
>> I don't object to some Notes, but I would probably phrase it more like we
>> are providing the standard definition of the jargon term "sampling without
>> replacement" in the case of non-uniform probabilities. To my mind (or more
>> accurately, with my background), "replace=False" obviously picks out the
>> implemented procedure, and I would have been incredibly surprised if it did
>> anything else. If the option were named "unique=True", then I would have
>> needed some more documentation to let me know exactly how it was
>> implemented.
>>
>> FWIW, I totally agree with Robert
>

With my own background (MSc. in Mathematics), I agree that this algorithm
is indeed the most natural one. And as I said, when I wanted to implement
something myself when I wanted to choose random combinations (k out of n
items), I wrote exactly the same one. But when it didn't produce the
desired probabilities (even in cases where I knew that doing this was
possible), I wrongly assumed numpy would do things differently - only to
realize it uses exactly the same algorithm. So clearly, the documentation
didn't quite explain what it does or doesn't do.

Also, Robert, I'm curious: beyond explaining why the existing algorithm is
reasonable (which I agree), could you give me an example of where it is
actually  *useful* for sampling?

Let me give you an illustrative counter-example:

Let's imagine a country that a country has 3 races: 40% Lilliputians, 40%
Blefuscans, an 20% Yahoos (immigrants from a different section of the book
;-)).
Gulliver wants to take a poll, and needs to sample people from all these
races with appropriate proportions.

These races live in different parts of town, so to pick a random person he
needs to first pick one of the races and then a random person from that
part of town.

If he picks one respondent at a time, he uses numpy.random.choice(3,
size=1,p=[0.4,0.4,0.2])) to pick the part of town, and then a person from
that part - he gets the desired 40% / 40% / 20% division of races.

Now imagine that Gulliver can interview two respondents each day, so he
needs to pick two people each time. If he picks 2 choices of part-of-town
*with* replacement, numpy.random.choice(3, size=2,p=[0.4,0.4,0.2]), that's
also fine: he may need to take two people from the same part of town, or
two from two different parts of town, but in any case will still get the
desired 40% / 40% / 20% division between the races of the people he
interviews.

But consider that we are told that if two people from the same race meet in
Gulliver's interview room, the two start chatting between themselves, and
waste Gulliver's time. So he prefers to interview two people of *different*
races. That's sampling without replacement. So he uses
numpy.random.choice(size=2,p=[0.4,0.4,0.2],replace=False) to pick two
different parts of town, and one person from each.
But then he looks at his logs, and discovers he actually interviewed the
races at 38% / 38% / 23% proportions - not the 40%/40%/20% he wanted.
So the opinions of the Yahoos were over-counted in this poll!

I know that this is a silly example (made even sillier by the names of
races I used), but I wonder if you could give me an example where the
current behavior of replace=False is genuinely useful.

Not that I'm saying that fixing this problem is easy (I'm still struggling
with it myself in the general case of size < n-1).

Nadav.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Question about numpy.random.choice with probabilties

2017-01-23 Thread Robert Kern
On Mon, Jan 23, 2017 at 9:22 AM, Anne Archibald 
wrote:
>
>
> On Mon, Jan 23, 2017 at 3:34 PM Robert Kern  wrote:
>>
>> I don't object to some Notes, but I would probably phrase it more like
we are providing the standard definition of the jargon term "sampling
without replacement" in the case of non-uniform probabilities. To my mind
(or more accurately, with my background), "replace=False" obviously picks
out the implemented procedure, and I would have been incredibly surprised
if it did anything else. If the option were named "unique=True", then I
would have needed some more documentation to let me know exactly how it was
implemented.
>
>
> It is what I would have expected too, but we have a concrete example of a
user who expected otherwise; where one user speaks up, there are probably
more who didn't (some of whom probably have code that's not doing what they
think it does). So for the cost of adding a Note, why not help some of them?

That's why I said I'm fine with adding a Note. I'm just suggesting a
re-wording so that the cautious language doesn't lead anyone who is
familiar with the jargon to think we're doing something ad hoc while still
providing the details for those who aren't so familiar.

> As for the standardness of the definition: I don't know, have you a
reference where it is defined? More natural to me would be to have a list
of items with integer multiplicities (as in: "cat" 3 times, "dog" 1 time).
I'm hesitant to claim ours is a standard definition unless it's in a
textbook somewhere. But I don't insist on my phrasing.

Textbook, I'm not so sure, but it is the *only* definition I've ever
encountered in the literature:

http://epubs.siam.org/doi/abs/10.1137/0209009
http://www.sciencedirect.com/science/article/pii/S002001900500298X

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Question about numpy.random.choice with probabilties

2017-01-23 Thread Anne Archibald
On Mon, Jan 23, 2017 at 3:34 PM Robert Kern  wrote:

> I don't object to some Notes, but I would probably phrase it more like we
> are providing the standard definition of the jargon term "sampling without
> replacement" in the case of non-uniform probabilities. To my mind (or more
> accurately, with my background), "replace=False" obviously picks out the
> implemented procedure, and I would have been incredibly surprised if it did
> anything else. If the option were named "unique=True", then I would have
> needed some more documentation to let me know exactly how it was
> implemented.
>

It is what I would have expected too, but we have a concrete example of a
user who expected otherwise; where one user speaks up, there are probably
more who didn't (some of whom probably have code that's not doing what they
think it does). So for the cost of adding a Note, why not help some of them?

As for the standardness of the definition: I don't know, have you a
reference where it is defined? More natural to me would be to have a list
of items with integer multiplicities (as in: "cat" 3 times, "dog" 1 time).
I'm hesitant to claim ours is a standard definition unless it's in a
textbook somewhere. But I don't insist on my phrasing.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Question about numpy.random.choice with probabilties

2017-01-23 Thread aleba...@gmail.com
2017-01-23 15:33 GMT+01:00 Robert Kern :

> On Mon, Jan 23, 2017 at 6:27 AM, Anne Archibald 
> wrote:
> >
> > On Wed, Jan 18, 2017 at 4:13 PM Nadav Har'El  wrote:
> >>
> >> On Wed, Jan 18, 2017 at 4:30 PM,  wrote:
> >>>
>  Having more sampling schemes would be useful, but it's not possible
> to implement sampling schemes with impossible properties.
> >>>
> >>> BTW: sampling 3 out of 3 without replacement is even worse
> >>>
> >>> No matter what sampling scheme and what selection probabilities we
> use, we always have every element with probability 1 in the sample.
> >>
> >> I agree. The random-sample function of the type I envisioned will be
> able to reproduce the desired probabilities in some cases (like the example
> I gave) but not in others. Because doing this correctly involves a set of n
> linear equations in comb(n,k) variables, it can have no solution, or many
> solutions, depending on the n and k, and the desired probabilities. A
> function of this sort could return an error if it can't achieve the desired
> probabilities.
> >
> > It seems to me that the basic problem here is that the
> numpy.random.choice docstring fails to explain what the function actually
> does when called with weights and without replacement. Clearly there are
> different expectations; I think numpy.random.choice chose one that is easy
> to explain and implement but not necessarily what everyone expects. So the
> docstring should be clarified. Perhaps a Notes section:
> >
> > When numpy.random.choice is called with replace=False and non-uniform
> probabilities, the resulting distribution of samples is not obvious.
> numpy.random.choice effectively follows the procedure: when choosing the
> kth element in a set, the probability of element i occurring is p[i]
> divided by the total probability of all not-yet-chosen (and therefore
> eligible) elements. This approach is always possible as long as the sample
> size is no larger than the population, but it means that the probability
> that element i occurs in the sample is not exactly p[i].
>
> I don't object to some Notes, but I would probably phrase it more like we
> are providing the standard definition of the jargon term "sampling without
> replacement" in the case of non-uniform probabilities. To my mind (or more
> accurately, with my background), "replace=False" obviously picks out the
> implemented procedure, and I would have been incredibly surprised if it did
> anything else. If the option were named "unique=True", then I would have
> needed some more documentation to let me know exactly how it was
> implemented.
>
> FWIW, I totally agree with Robert


>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>


-- 
--
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain
confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
--
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Question about numpy.random.choice with probabilties

2017-01-23 Thread Robert Kern
On Mon, Jan 23, 2017 at 6:27 AM, Anne Archibald 
wrote:
>
> On Wed, Jan 18, 2017 at 4:13 PM Nadav Har'El  wrote:
>>
>> On Wed, Jan 18, 2017 at 4:30 PM,  wrote:
>>>
 Having more sampling schemes would be useful, but it's not possible to
implement sampling schemes with impossible properties.
>>>
>>> BTW: sampling 3 out of 3 without replacement is even worse
>>>
>>> No matter what sampling scheme and what selection probabilities we use,
we always have every element with probability 1 in the sample.
>>
>> I agree. The random-sample function of the type I envisioned will be
able to reproduce the desired probabilities in some cases (like the example
I gave) but not in others. Because doing this correctly involves a set of n
linear equations in comb(n,k) variables, it can have no solution, or many
solutions, depending on the n and k, and the desired probabilities. A
function of this sort could return an error if it can't achieve the desired
probabilities.
>
> It seems to me that the basic problem here is that the
numpy.random.choice docstring fails to explain what the function actually
does when called with weights and without replacement. Clearly there are
different expectations; I think numpy.random.choice chose one that is easy
to explain and implement but not necessarily what everyone expects. So the
docstring should be clarified. Perhaps a Notes section:
>
> When numpy.random.choice is called with replace=False and non-uniform
probabilities, the resulting distribution of samples is not obvious.
numpy.random.choice effectively follows the procedure: when choosing the
kth element in a set, the probability of element i occurring is p[i]
divided by the total probability of all not-yet-chosen (and therefore
eligible) elements. This approach is always possible as long as the sample
size is no larger than the population, but it means that the probability
that element i occurs in the sample is not exactly p[i].

I don't object to some Notes, but I would probably phrase it more like we
are providing the standard definition of the jargon term "sampling without
replacement" in the case of non-uniform probabilities. To my mind (or more
accurately, with my background), "replace=False" obviously picks out the
implemented procedure, and I would have been incredibly surprised if it did
anything else. If the option were named "unique=True", then I would have
needed some more documentation to let me know exactly how it was
implemented.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Question about numpy.random.choice with probabilties

2017-01-23 Thread Anne Archibald
On Wed, Jan 18, 2017 at 4:13 PM Nadav Har'El  wrote:

> On Wed, Jan 18, 2017 at 4:30 PM,  wrote:
>
>
>
> Having more sampling schemes would be useful, but it's not possible to
> implement sampling schemes with impossible properties.
>
>
>
> BTW: sampling 3 out of 3 without replacement is even worse
>
> No matter what sampling scheme and what selection probabilities we use, we
> always have every element with probability 1 in the sample.
>
>
> I agree. The random-sample function of the type I envisioned will be able
> to reproduce the desired probabilities in some cases (like the example I
> gave) but not in others. Because doing this correctly involves a set of n
> linear equations in comb(n,k) variables, it can have no solution, or many
> solutions, depending on the n and k, and the desired probabilities. A
> function of this sort could return an error if it can't achieve the desired
> probabilities.
>

It seems to me that the basic problem here is that the numpy.random.choice
docstring fails to explain what the function actually does when called with
weights and without replacement. Clearly there are different expectations;
I think numpy.random.choice chose one that is easy to explain and implement
but not necessarily what everyone expects. So the docstring should be
clarified. Perhaps a Notes section:

When numpy.random.choice is called with replace=False and non-uniform
probabilities, the resulting distribution of samples is not obvious.
numpy.random.choice effectively follows the procedure: when choosing the
kth element in a set, the probability of element i occurring is p[i]
divided by the total probability of all not-yet-chosen (and therefore
eligible) elements. This approach is always possible as long as the sample
size is no larger than the population, but it means that the probability
that element i occurs in the sample is not exactly p[i].

Anne

>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy 1.11.3, scipy 0.18.1, MSVC 2015 and crashes in complex functions

2017-01-23 Thread David Cournapeau
Indeed. I wrongly assumed that since gholke's wheels did not crash, they
did not run into that issue.

That sounds like an ABI issue, since I suspect intel math library supports
C99 complex numbers. I will add info on that issue then,

David

On Mon, Jan 23, 2017 at 11:46 AM, Evgeni Burovski <
evgeny.burovs...@gmail.com> wrote:

> Related to https://github.com/scipy/scipy/issues/6336?
> 23.01.2017 14:40 пользователь "David Cournapeau" 
> написал:
>
>> Hi there,
>>
>> While building the latest scipy on top of numpy 1.11.3, I have noticed
>> crashes while running the scipy test suite, in scipy.special (e.g. in
>> scipy.special hyp0f1 test).. This only happens on windows for python 3.5
>> (where we use MSVC 2015 compiler).
>>
>> Applying some violence to distutils, I re-built numpy/scipy with debug
>> symbols, and the debugger claims that crashes happen inside scipy.special
>> ufunc cython code, when calling clog or csqrt. I first suspected a compiler
>> bug, but disabling those functions in numpy, to force using our own
>> versions in npymath, made the problem go away.
>>
>> I am a bit suspicious about the whole thing as neither conda's or
>> gholke's wheel crashed. Has anybody else encountered this ?
>>
>> David
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy 1.11.3, scipy 0.18.1, MSVC 2015 and crashes in complex functions

2017-01-23 Thread Evgeni Burovski
Related to https://github.com/scipy/scipy/issues/6336?
23.01.2017 14:40 пользователь "David Cournapeau" 
написал:

> Hi there,
>
> While building the latest scipy on top of numpy 1.11.3, I have noticed
> crashes while running the scipy test suite, in scipy.special (e.g. in
> scipy.special hyp0f1 test).. This only happens on windows for python 3.5
> (where we use MSVC 2015 compiler).
>
> Applying some violence to distutils, I re-built numpy/scipy with debug
> symbols, and the debugger claims that crashes happen inside scipy.special
> ufunc cython code, when calling clog or csqrt. I first suspected a compiler
> bug, but disabling those functions in numpy, to force using our own
> versions in npymath, made the problem go away.
>
> I am a bit suspicious about the whole thing as neither conda's or gholke's
> wheel crashed. Has anybody else encountered this ?
>
> David
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Numpy 1.11.3, scipy 0.18.1, MSVC 2015 and crashes in complex functions

2017-01-23 Thread David Cournapeau
Hi there,

While building the latest scipy on top of numpy 1.11.3, I have noticed
crashes while running the scipy test suite, in scipy.special (e.g. in
scipy.special hyp0f1 test).. This only happens on windows for python 3.5
(where we use MSVC 2015 compiler).

Applying some violence to distutils, I re-built numpy/scipy with debug
symbols, and the debugger claims that crashes happen inside scipy.special
ufunc cython code, when calling clog or csqrt. I first suspected a compiler
bug, but disabling those functions in numpy, to force using our own
versions in npymath, made the problem go away.

I am a bit suspicious about the whole thing as neither conda's or gholke's
wheel crashed. Has anybody else encountered this ?

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion