Re: [Numpy-discussion] Question about numpy.random.choice with probabilties
On Mon, Jan 23, 2017 at 4:52 PM, aleba...@gmail.comwrote: > > > 2017-01-23 15:33 GMT+01:00 Robert Kern : > >> >> I don't object to some Notes, but I would probably phrase it more like we >> are providing the standard definition of the jargon term "sampling without >> replacement" in the case of non-uniform probabilities. To my mind (or more >> accurately, with my background), "replace=False" obviously picks out the >> implemented procedure, and I would have been incredibly surprised if it did >> anything else. If the option were named "unique=True", then I would have >> needed some more documentation to let me know exactly how it was >> implemented. >> >> FWIW, I totally agree with Robert > With my own background (MSc. in Mathematics), I agree that this algorithm is indeed the most natural one. And as I said, when I wanted to implement something myself when I wanted to choose random combinations (k out of n items), I wrote exactly the same one. But when it didn't produce the desired probabilities (even in cases where I knew that doing this was possible), I wrongly assumed numpy would do things differently - only to realize it uses exactly the same algorithm. So clearly, the documentation didn't quite explain what it does or doesn't do. Also, Robert, I'm curious: beyond explaining why the existing algorithm is reasonable (which I agree), could you give me an example of where it is actually *useful* for sampling? Let me give you an illustrative counter-example: Let's imagine a country that a country has 3 races: 40% Lilliputians, 40% Blefuscans, an 20% Yahoos (immigrants from a different section of the book ;-)). Gulliver wants to take a poll, and needs to sample people from all these races with appropriate proportions. These races live in different parts of town, so to pick a random person he needs to first pick one of the races and then a random person from that part of town. If he picks one respondent at a time, he uses numpy.random.choice(3, size=1,p=[0.4,0.4,0.2])) to pick the part of town, and then a person from that part - he gets the desired 40% / 40% / 20% division of races. Now imagine that Gulliver can interview two respondents each day, so he needs to pick two people each time. If he picks 2 choices of part-of-town *with* replacement, numpy.random.choice(3, size=2,p=[0.4,0.4,0.2]), that's also fine: he may need to take two people from the same part of town, or two from two different parts of town, but in any case will still get the desired 40% / 40% / 20% division between the races of the people he interviews. But consider that we are told that if two people from the same race meet in Gulliver's interview room, the two start chatting between themselves, and waste Gulliver's time. So he prefers to interview two people of *different* races. That's sampling without replacement. So he uses numpy.random.choice(size=2,p=[0.4,0.4,0.2],replace=False) to pick two different parts of town, and one person from each. But then he looks at his logs, and discovers he actually interviewed the races at 38% / 38% / 23% proportions - not the 40%/40%/20% he wanted. So the opinions of the Yahoos were over-counted in this poll! I know that this is a silly example (made even sillier by the names of races I used), but I wonder if you could give me an example where the current behavior of replace=False is genuinely useful. Not that I'm saying that fixing this problem is easy (I'm still struggling with it myself in the general case of size < n-1). Nadav. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about numpy.random.choice with probabilties
On Mon, Jan 23, 2017 at 5:47 PM, Robert Kernwrote: > > > As for the standardness of the definition: I don't know, have you a > reference where it is defined? More natural to me would be to have a list > of items with integer multiplicities (as in: "cat" 3 times, "dog" 1 time). > I'm hesitant to claim ours is a standard definition unless it's in a > textbook somewhere. But I don't insist on my phrasing. > > Textbook, I'm not so sure, but it is the *only* definition I've ever > encountered in the literature: > > http://epubs.siam.org/doi/abs/10.1137/0209009 > Very interesting. This paper (PDF available if you search for its name in Google) explicitly mentions one of the uses of this algorithm is "multistage sampling", which appears to be exactly the same thing as in the hypothetical Gulliver example I gave in my earlier mail. And yet, I showed in my mail that this algorithm does NOT reproduce the desired frequency of the different sampling units... Moreover, this paper doesn't explain why you need the "without replacement" for this use case (everything seems easier, and the desired probabilities are reproduced, with replacement). In my story I gave a funny excuse why "without replacement" might be warrented, but if you're interested I can tell you a bit about my actual use case, with a more serious reason why I want without replacement. > http://www.sciencedirect.com/science/article/pii/S002001900500298X > > -- > Robert Kern > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about numpy.random.choice with probabilties
On Wed, Jan 18, 2017 at 1:58 AM, aleba...@gmail.com <aleba...@gmail.com> wrote: > > > 2017-01-17 22:13 GMT+01:00 Nadav Har'El <n...@scylladb.com>: > >> >> On Tue, Jan 17, 2017 at 7:18 PM, aleba...@gmail.com <aleba...@gmail.com> >> wrote: >> >>> Hi Nadav, >>> >>> I may be wrong, but I think that the result of the current >>> implementation is actually the expected one. >>> Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and >>> 0.4 >>> >>> P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2]) >>> >> >> Yes, this formula does fit well with the actual algorithm in the code. >> But, my question is *why* we want this formula to be correct: >> >> Just a note: this formula is correct and it is one of statistics > fundamental law: https://en.wikipedia.org/wiki/Law_of_total_probability + > https://en.wikipedia.org/wiki/Bayes%27_theorem > Hi, Yes, of course the formula is correct, but it doesn't mean we're not applying it in the wrong context. I'll be honest here: I came to numpy.random.choice after I actually coded a similar algorithm (with the same results) myself, because like you I thought this was the "obvious" and correct algorithm. Only then I realized that its output doesn't actually produce the desired probabilities specified by the user - even in the cases where that is possible. And I started wondering if existing libraries - like numpy - do this differently. And it turns out, numpy does it (basically) in the same way as my algorithm. > > Thus, the result we get from random.choice IMHO definitely makes sense. > Let's look at what the user asked this function, and what it returns: User asks: please give me random pairs of the three items, where item 1 has probability 0.2, item 2 has 0.4, and 3 has 0.4. Function returns: random pairs, where if you make many random returned results (as in the law of large numbers) and look at the items they contain, item 1 is 0.2333 of the items, item 2 is 0.38333, and item 3 is 0.38333. These are not (quite) the probabilities the user asked for... Can you explain a sense where the user's requested probabilities (0.2, 0.4, 0.4) are actually adhered in the results which random.choice returns? Thanks, Nadav Har'El. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about numpy.random.choice with probabilties
On Wed, Jan 18, 2017 at 11:00 AM, aleba...@gmail.comwrote: > Let's look at what the user asked this function, and what it returns: > >> >> User asks: please give me random pairs of the three items, where item 1 >> has probability 0.2, item 2 has 0.4, and 3 has 0.4. >> >> Function returns: random pairs, where if you make many random returned >> results (as in the law of large numbers) and look at the items they >> contain, item 1 is 0.2333 of the items, item 2 is 0.38333, and item 3 is >> 0.38333. >> These are not (quite) the probabilities the user asked for... >> >> Can you explain a sense where the user's requested probabilities (0.2, >> 0.4, 0.4) are actually adhered in the results which random.choice returns? >> > > I think that the question the user is asking by specifying p is a slightly > different one: > "please give me random pairs of the three items extracted from a > population of 3 items where item 1 has probability of being extracted of > 0.2, item 2 has 0.4, and 3 has 0.4. Also please remove extract items once > extracted." > You are right, if that is what the user wants, numpy.random.choice does the right thing. I'm just wondering whether this is actually what users want, and whether they understand this is what they are getting. As I said, I expected it to generate pairs with, empirically, the desired distribution of individual items. The documentation of numpy.random.choice seemed to me (wrongly) that it implis that that's what it does. So I was surprised to realize that it does not. Nadav. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about numpy.random.choice with probabilties
= 0.2 + 0.2 = 2*P1 of the trials, item 2 in [1,2]+[2,3] = >> 0.2+0.6 = 0.8 = 2*P2, etc. >> >> However, the algorithm in numpy.random.choice's replace=False generates, >> if >> I understand correctly, different probabilities for the outcomes: I >> believe >> in this case it generates [1,2] at probability 0.2, [1,3] also 0.2333, >> and [2,3] at probability 0.5. >> >> My question is how does this result fit the desired probabilities? >> >> If we get [1,2] at probability 0.2 and [1,3] at probability 0.2333, >> then the expect number of "1" results we'll get per drawing is 0.23333 + >> 0.2333 = 0.4, and similarly for "2" the expected number 0.7666, and >> for >> "3" 0.7. As you can see, the proportions are off: Item 2 is NOT twice >> common than item 1 as we originally desired (we asked for probabilities >> 0.2, 0.4, 0.4 for the individual items!). >> >> >> -- >> Nadav Har'El >> n...@scylladb.com >> -- next part -- >> An HTML attachment was scrubbed... >> URL: <https://mail.scipy.org/pipermail/numpy-discussion/attachmen >> ts/20170117/d1f0a1db/attachment-0001.html> >> >> -- >> >> Subject: Digest Footer >> >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> https://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> -- >> >> End of NumPy-Discussion Digest, Vol 124, Issue 24 >> * >> > > > > -- > -- > NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain > confidential information and are intended for the sole use of the > recipient(s) named above. If you are not the intended recipient of this > message you are hereby notified that any dissemination or copying of this > message is strictly prohibited. If you have received this e-mail in error, > please notify the sender either by telephone or by e-mail and delete the > material from any computer. Thank you. > -- > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Question about numpy.random.choice with probabilties
Hi, I'm looking for a way to find a random sample of C different items out of N items, with a some desired probabilty Pi for each item i. I saw that numpy has a function that supposedly does this, numpy.random.choice (with replace=False and a probabilities array), but looking at the algorithm actually implemented, I am wondering in what sense are the probabilities Pi actually obeyed... To me, the code doesn't seem to be doing the right thing... Let me explain: Consider a simple numerical example: We have 3 items, and need to pick 2 different ones randomly. Let's assume the desired probabilities for item 1, 2 and 3 are: 0.2, 0.4 and 0.4. Working out the equations there is exactly one solution here: The random outcome of numpy.random.choice in this case should be [1,2] at probability 0.2, [1,3] at probabilty 0.2, and [2,3] at probability 0.6. That is indeed a solution for the desired probabilities because it yields item 1 in [1,2]+[1,3] = 0.2 + 0.2 = 2*P1 of the trials, item 2 in [1,2]+[2,3] = 0.2+0.6 = 0.8 = 2*P2, etc. However, the algorithm in numpy.random.choice's replace=False generates, if I understand correctly, different probabilities for the outcomes: I believe in this case it generates [1,2] at probability 0.2, [1,3] also 0.2333, and [2,3] at probability 0.5. My question is how does this result fit the desired probabilities? If we get [1,2] at probability 0.2 and [1,3] at probability 0.2333, then the expect number of "1" results we'll get per drawing is 0.2 + 0.2333 = 0.4, and similarly for "2" the expected number 0.7666, and for "3" 0.7. As you can see, the proportions are off: Item 2 is NOT twice common than item 1 as we originally desired (we asked for probabilities 0.2, 0.4, 0.4 for the individual items!). -- Nadav Har'El n...@scylladb.com ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about numpy.random.choice with probabilties
On Wed, Jan 18, 2017 at 4:30 PM,wrote: > > > Having more sampling schemes would be useful, but it's not possible to >> implement sampling schemes with impossible properties. >> >> > > BTW: sampling 3 out of 3 without replacement is even worse > > No matter what sampling scheme and what selection probabilities we use, we > always have every element with probability 1 in the sample. > I agree. The random-sample function of the type I envisioned will be able to reproduce the desired probabilities in some cases (like the example I gave) but not in others. Because doing this correctly involves a set of n linear equations in comb(n,k) variables, it can have no solution, or many solutions, depending on the n and k, and the desired probabilities. A function of this sort could return an error if it can't achieve the desired probabilities. But in many cases (the 0.2, 0.4, 0.4 example I gave was just something random I tried) there will be a way to achieve exactly the desired distribution. I guess I'll need to write this new function myself :-) Because my use case definitely requires that the output of the random items produced matches the required probabilities (when possible). Thanks, Nadav. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion