Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-17 Thread Nathaniel Smith
On Tue, May 17, 2016 at 10:41 AM, Robert Kern  wrote:
> On Tue, May 17, 2016 at 6:24 PM, Nathaniel Smith  wrote:
>>
>> On May 17, 2016 1:50 AM, "Robert Kern"  wrote:
>> >
>> [...]
>> > What you want is a function that returns many RandomState objects that
>> > are hopefully spread around the MT19937 space enough that they are
>> > essentially independent (in the absence of true jumpahead). The better
>> > implementation of such a function would look something like this:
>> >
>> > def spread_out_prngs(n, root_prng=None):
>> > if root_prng is None:
>> > root_prng = np.random
>> > elif not isinstance(root_prng, np.random.RandomState):
>> > root_prng = np.random.RandomState(root_prng)
>> > sprouted_prngs = []
>> > for i in range(n):
>> > seed_array = root_prng.randint(1<<32, size=624)  #
>> > dtype=np.uint32 under 1.11
>> > sprouted_prngs.append(np.random.RandomState(seed_array))
>> > return spourted_prngs
>>
>> Maybe a nice way to encapsulate this in the RandomState interface would be
>> a method RandomState.random_state() that generates and returns a new child
>> RandomState.
>
> I disagree. This is a workaround in the absence of proper jumpahead or
> guaranteed-independent streams. I would not encourage it.
>
>> > Internally, this generates seed arrays of about the size of the MT19937
>> > state so make sure that you can access more of the state space. That will 
>> > at
>> > least make the chance of collision tiny. And it can be easily rewritten to
>> > take advantage of one of the newer PRNGs that have true independent 
>> > streams:
>> >
>> >   https://github.com/bashtage/ng-numpy-randomstate
>>
>> ... But unfortunately I'm not sure how to make my interface suggestion
>> above work on top of one of these RNGs, because for RandomState.random_state
>> you really want a tree of independent RNGs and the fancy new PRNGs only
>> provide a single flat namespace :-/. And even more annoyingly, the tree API
>> is actually a nicer API, because with a flat namespace you have to know up
>> front about all possible RNGs your code will use, which is an unfortunate
>> global coupling that makes it difficult to compose programs out of
>> independent pieces, while the RandomState.random_state approach composes
>> beautifully. Maybe there's some clever way to allocate a 64-bit namespace to
>> make it look tree-like? I'm not sure 64 bits is really enough...
>
> MT19937 doesn't have a "tree" any more than the others. It's the same flat
> state space. You are just getting the illusion of a tree by hoping that you
> never collide. You ought to think about precisely the same global coupling
> issues with MT19937 as you do with guaranteed-independent streams.
> Hope-and-prayer isn't really a substitute for properly engineering your
> problem. It's just a moral hazard to promote this method to the main API.

Nonsense.

If your definition of "hope and prayer" includes assuming that we
won't encounter a random collision in a 2**19937 state space, then
literally all engineering is hope-and-prayer. A collision could
happen, but if it does it's overwhelmingly more likely to happen
because of a flaw in the mathematical analysis, or a bug in the
implementation, or because random quantum fluctuations caused you and
your program to suddenly be transported to a parallel world where 1 +
1 = 1, than that you just got unlucky with your random state. And all
of these hazards apply equally to both MT19937 and more modern PRNGs.

...anyway, the real reason I'm a bit grumpy is because there are solid
engineering reasons why users *want* this API, so whether or not it
turns out to be possible I think we should at least be allowed to have
a discussion about whether there's some way to give it to them. It's
not even 100% out of the question that we conclude that existing PRNGs
are buggy because they don't take this use case into account -- it
would be far from the first time that numpy found itself going beyond
the limits of older numerical tools that weren't designed to build the
kind of large composable systems that numpy gets used for.

MT19937's state space is large enough that you could explicitly encode
a "tree seed" into it, even if you don't trust the laws of probability
-- e.g., you start with a RandomState with id [], then its children
have id [0], [1], [2], ..., and their children have ids [0, 0], [0,
1], ..., [1, 0], ..., and you write these into the state (probably
after sending them through some bit-mixing permutation), to guarantee
non-collision.

Or if you do trust the laws of probability, then the
randomly-sample-a-PRNG approach is not 100% out of the question even
for more modern PRNGs. For example, let's take a PRNG with a 64-bit
stream id and a 64-bit state space per stream. Suppose that we know
that in our application each PRNG will be used to draw 2**48 64-bit
samples (= 2 pebibytes of random data), and that we will use 2**32
PRNGs (= total of 8 yobibytes of random data). If w

Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-05-17 Thread Sturla Molden
Matěj Týč  wrote:

> Does it mean
> that if you pass the numpy array to the child process using Queue, no
> significant amount of data will flow through it? 

This is what my shared memory arrayes do.

> Or I shouldn't pass it
> using Queue at all and just rely on inheritance? 

This is what David Baddeley's shared memory arrays do.

> Finally, I assume that
> passing it as an argument to the Process class is the worst option,
> because it will be pickled and unpickled.
 
My shared memory arrays only pickles the metadata, and can be used in this
way.

> Or maybe you refer to modules s.a. joblib that use this functionality
> and expose only a nice interface?

Joblib creates "share memory" by memory mapping a temporary file, which is
back by RAM on Libux (tempfs). It is backed by a physical file on disk on
Mac and Windows. In this resepect, joblib is much better on Linux than Mac
or Windows.


> And finally, cow means that returning large arrays still involves data
> moving between processes, whereas the shm approach has the workaround
> that you can preallocate the result array by the parent process, where
> the worker process can write to.

My shared memory arrays need no workaround dor this. They also allow shared
memory arrays to be returned to the parent process. No preallocation is
needed.


Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-05-17 Thread Matěj Týč
On 17.5.2016 14:13, Sturla Molden wrote:

> Matěj Týč  wrote:
>
>>  - Parallel processing of HUGE data, and
> This is mainly a Windows problem, as copy-on-write fork() will solve this
> on any other platform. ...
That sounds interesting, could you elaborate on it a bit? Does it mean
that if you pass the numpy array to the child process using Queue, no
significant amount of data will flow through it? Or I shouldn't pass it
using Queue at all and just rely on inheritance? Finally, I assume that
passing it as an argument to the Process class is the worst option,
because it will be pickled and unpickled.

Or maybe you refer to modules s.a. joblib that use this functionality
and expose only a nice interface?
And finally, cow means that returning large arrays still involves data
moving between processes, whereas the shm approach has the workaround
that you can preallocate the result array by the parent process, where
the worker process can write to.
> What this means is that shared memory is seldom useful for sharing huge
> data, even on Windows. It is only useful for this on Unix/Linux, where base
> addresses can stay they same. But on non-Windows platforms, the COW will in
> 99.99% of the cases be sufficient, thus make shared memory superfluous
> anyway. We don't need shared memory to scatter large data on Linux, only
> fork.
I am actually quite comfortable with sharing numpy arrays only. It is a
nice format for sharing large amounts of numbers, which is what I want
and what many modules accept as input (e.g. the "shapely" module).

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-17 Thread Robert Kern
On Tue, May 17, 2016 at 6:24 PM, Nathaniel Smith  wrote:
>
> On May 17, 2016 1:50 AM, "Robert Kern"  wrote:
> >
> [...]
> > What you want is a function that returns many RandomState objects that
are hopefully spread around the MT19937 space enough that they are
essentially independent (in the absence of true jumpahead). The better
implementation of such a function would look something like this:
> >
> > def spread_out_prngs(n, root_prng=None):
> > if root_prng is None:
> > root_prng = np.random
> > elif not isinstance(root_prng, np.random.RandomState):
> > root_prng = np.random.RandomState(root_prng)
> > sprouted_prngs = []
> > for i in range(n):
> > seed_array = root_prng.randint(1<<32, size=624)  #
dtype=np.uint32 under 1.11
> > sprouted_prngs.append(np.random.RandomState(seed_array))
> > return spourted_prngs
>
> Maybe a nice way to encapsulate this in the RandomState interface would
be a method RandomState.random_state() that generates and returns a new
child RandomState.

I disagree. This is a workaround in the absence of proper jumpahead or
guaranteed-independent streams. I would not encourage it.

> > Internally, this generates seed arrays of about the size of the MT19937
state so make sure that you can access more of the state space. That will
at least make the chance of collision tiny. And it can be easily rewritten
to take advantage of one of the newer PRNGs that have true independent
streams:
> >
> >   https://github.com/bashtage/ng-numpy-randomstate
>
> ... But unfortunately I'm not sure how to make my interface suggestion
above work on top of one of these RNGs, because for
RandomState.random_state you really want a tree of independent RNGs and the
fancy new PRNGs only provide a single flat namespace :-/. And even more
annoyingly, the tree API is actually a nicer API, because with a flat
namespace you have to know up front about all possible RNGs your code will
use, which is an unfortunate global coupling that makes it difficult to
compose programs out of independent pieces, while the
RandomState.random_state approach composes beautifully. Maybe there's some
clever way to allocate a 64-bit namespace to make it look tree-like? I'm
not sure 64 bits is really enough...

MT19937 doesn't have a "tree" any more than the others. It's the same flat
state space. You are just getting the illusion of a tree by hoping that you
never collide. You ought to think about precisely the same global coupling
issues with MT19937 as you do with guaranteed-independent streams.
Hope-and-prayer isn't really a substitute for properly engineering your
problem. It's just a moral hazard to promote this method to the main API.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-17 Thread Nathaniel Smith
On May 17, 2016 1:50 AM, "Robert Kern"  wrote:
>
[...]
> What you want is a function that returns many RandomState objects that
are hopefully spread around the MT19937 space enough that they are
essentially independent (in the absence of true jumpahead). The better
implementation of such a function would look something like this:
>
> def spread_out_prngs(n, root_prng=None):
> if root_prng is None:
> root_prng = np.random
> elif not isinstance(root_prng, np.random.RandomState):
> root_prng = np.random.RandomState(root_prng)
> sprouted_prngs = []
> for i in range(n):
> seed_array = root_prng.randint(1<<32, size=624)  #
dtype=np.uint32 under 1.11
> sprouted_prngs.append(np.random.RandomState(seed_array))
> return spourted_prngs

Maybe a nice way to encapsulate this in the RandomState interface would be
a method RandomState.random_state() that generates and returns a new child
RandomState.

> Internally, this generates seed arrays of about the size of the MT19937
state so make sure that you can access more of the state space. That will
at least make the chance of collision tiny. And it can be easily rewritten
to take advantage of one of the newer PRNGs that have true independent
streams:
>
>   https://github.com/bashtage/ng-numpy-randomstate

... But unfortunately I'm not sure how to make my interface suggestion
above work on top of one of these RNGs, because for
RandomState.random_state you really want a tree of independent RNGs and the
fancy new PRNGs only provide a single flat namespace :-/. And even more
annoyingly, the tree API is actually a nicer API, because with a flat
namespace you have to know up front about all possible RNGs your code will
use, which is an unfortunate global coupling that makes it difficult to
compose programs out of independent pieces, while the
RandomState.random_state approach composes beautifully. Maybe there's some
clever way to allocate a 64-bit namespace to make it look tree-like? I'm
not sure 64 bits is really enough...

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-17 Thread Eric Moore
On Tue, May 17, 2016 at 9:40 AM, Sturla Molden 
wrote:

> Stephan Hoyer  wrote:
> > I have recently encountered several use cases for randomly generate
> random
> > number seeds:
> >
> > 1. When writing a library of stochastic functions that take a seed as an
> > input argument, and some of these functions call multiple other such
> > stochastic functions. Dask is one such example [1].
> >
> > 2. When a library needs to produce results that are reproducible after
> > calling numpy.random.seed, but that do not want to use the functions in
> > numpy.random directly. This came up recently in a pandas pull request
> [2],
> > because we want to allow using RandomState objects as an alternative to
> > global state in numpy.random. A major advantage of this approach is that
> it
> > provides an obvious alternative to reusing the private
> numpy.random._mtrand
> > [3].
>
>
> What about making numpy.random a finite state machine, and keeping a stack
> of RandomState seeds? That is, something similar to what OpenGL does for
> its matrices? Then we get two functions, numpy.random.push_seed and
> numpy.random.pop_seed.
>

I don't like the idea of adding this kind of internal state.  Having it
built into the module means that it is shared by all callers, libraries
user code etc. That's not the right choice when a stack of seeds could be
easily built around the RandomState object if that is really what someone
needs.

Eric
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-17 Thread Robert Kern
On Tue, May 17, 2016 at 2:40 PM, Sturla Molden 
wrote:
>
> Stephan Hoyer  wrote:
> > I have recently encountered several use cases for randomly generate
random
> > number seeds:
> >
> > 1. When writing a library of stochastic functions that take a seed as an
> > input argument, and some of these functions call multiple other such
> > stochastic functions. Dask is one such example [1].
> >
> > 2. When a library needs to produce results that are reproducible after
> > calling numpy.random.seed, but that do not want to use the functions in
> > numpy.random directly. This came up recently in a pandas pull request
[2],
> > because we want to allow using RandomState objects as an alternative to
> > global state in numpy.random. A major advantage of this approach is
that it
> > provides an obvious alternative to reusing the private
numpy.random._mtrand
> > [3].
>
> What about making numpy.random a finite state machine, and keeping a stack
> of RandomState seeds? That is, something similar to what OpenGL does for
> its matrices? Then we get two functions, numpy.random.push_seed and
> numpy.random.pop_seed.

I don't think that addresses the issues brought up here. It's just more
global state to worry about.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-17 Thread Sturla Molden
Stephan Hoyer  wrote:
> I have recently encountered several use cases for randomly generate random
> number seeds:
> 
> 1. When writing a library of stochastic functions that take a seed as an
> input argument, and some of these functions call multiple other such
> stochastic functions. Dask is one such example [1].
> 
> 2. When a library needs to produce results that are reproducible after
> calling numpy.random.seed, but that do not want to use the functions in
> numpy.random directly. This came up recently in a pandas pull request [2],
> because we want to allow using RandomState objects as an alternative to
> global state in numpy.random. A major advantage of this approach is that it
> provides an obvious alternative to reusing the private numpy.random._mtrand
> [3].


What about making numpy.random a finite state machine, and keeping a stack
of RandomState seeds? That is, something similar to what OpenGL does for
its matrices? Then we get two functions, numpy.random.push_seed and
numpy.random.pop_seed.


Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-17 Thread josef.pktd
On Tue, May 17, 2016 at 4:49 AM, Robert Kern  wrote:

> On Tue, May 17, 2016 at 9:09 AM, Stephan Hoyer  wrote:
> >
> > On Tue, May 17, 2016 at 12:18 AM, Robert Kern 
> wrote:
> >>
> >> On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer 
> wrote:
> >> > 1. When writing a library of stochastic functions that take a seed as
> an input argument, and some of these functions call multiple other such
> stochastic functions. Dask is one such example [1].
> >>
> >> Can you clarify the use case here? I don't really know what you are
> doing here, but I'm pretty sure this is not the right approach.
> >
> > Here's a contrived example. Suppose I've written a simulator for cars
> that consists of a number of loosely connected components (e.g., an engine,
> brakes, etc.). The behavior of each component of our simulator is
> stochastic, but we want everything to be fully reproducible, so we need to
> use seeds or RandomState objects.
> >
> > We might write our simulate_car function like the following:
> >
> > def simulate_car(engine_config, brakes_config, seed=None):
> > rs = np.random.RandomState(seed)
> > engine = simulate_engine(engine_config, seed=rs.random_seed())
> > brakes = simulate_brakes(brakes_config, seed=rs.random_seed())
> > ...
> >
> > The problem with passing the same RandomState object (either explicitly
> or dropping the seed argument entirely and using the  global state) to both
> simulate_engine and simulate_breaks is that it breaks encapsulation -- if I
> change what I do inside simulate_engine, it also effects the brakes.
>
> That's a little too contrived, IMO. In most such simulations, the
> different components interact with each other in the normal course of the
> simulation; that's why they are both joined together in the same simulation
> instead of being two separate runs. Unless if the components are being run
> across a process or thread boundary (a la dask below) where true
> nondeterminism comes into play, then I don't think you want these
> semi-independent streams. This seems to be the advice du jour from the
> agent-based modeling community.
>


similar usecase where I had to switch to using several RandomStates

In a Monte Carlo experiment with increasing sample size, I want two random
variables, x, y, to have the same the same draws in the common initial
observations.

If I draw x and y sequentially, and then increase the number of
observations for the simulation, then it completely changes the draws for
second variable if they use a common RandomState.

With separate random states, increasing from 1000 to 1200 observations,
leaves the first 1000 draws unchanged.
(This reduces the Monte Carlo noise for example when calculating the power
of a hypothesis test as function of the sample size.)

Josef


>
>
> > The dask use case is actually pretty different -- the intent is to
> create many random numbers in parallel using multiple threads or processes
> (possibly in a distributed fashion). I know that skipping ahead is the
> standard way to get independent number streams for parallel sampling, but
> that isn't exposed in numpy.random, and setting distinct seeds seems like a
> reasonable alternative for scientific computing use cases.
>
> Forget about integer seeds. Those are for human convenience. If you're not
> jotting them down in your lab notebook in pen, you don't want an integer
> seed.
>
> What you want is a function that returns many RandomState objects that are
> hopefully spread around the MT19937 space enough that they are essentially
> independent (in the absence of true jumpahead). The better implementation
> of such a function would look something like this:
>
> def spread_out_prngs(n, root_prng=None):
> if root_prng is None:
> root_prng = np.random
> elif not isinstance(root_prng, np.random.RandomState):
> root_prng = np.random.RandomState(root_prng)
> sprouted_prngs = []
> for i in range(n):
> seed_array = root_prng.randint(1<<32, size=624)  # dtype=np.uint32
> under 1.11
> sprouted_prngs.append(np.random.RandomState(seed_array))
> return spourted_prngs
>
> Internally, this generates seed arrays of about the size of the MT19937
> state so make sure that you can access more of the state space. That will
> at least make the chance of collision tiny. And it can be easily rewritten
> to take advantage of one of the newer PRNGs that have true independent
> streams:
>
>   https://github.com/bashtage/ng-numpy-randomstate
>
> --
> Robert Kern
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-05-17 Thread Sturla Molden
Matěj Týč  wrote:

>  - Parallel processing of HUGE data, and

This is mainly a Windows problem, as copy-on-write fork() will solve this
on any other platform. I am more in favor of asking  Microsoft to fix their
broken OS. 

Also observe that the usefulness of shared memory is very limited on
Windows, as we in practice never get the same base address in a spawned
process. This prevents sharing data structures with pointers and Python
objects. Anything more complex than an array cannot be shared.

What this means is that shared memory is seldom useful for sharing huge
data, even on Windows. It is only useful for this on Unix/Linux, where base
addresses can stay they same. But on non-Windows platforms, the COW will in
99.99% of the cases be sufficient, thus make shared memory superfluous
anyway. We don't need shared memory to scatter large data on Linux, only
fork.

As I see it. shared memory is mostly useful as a means to construct an
inter-process communication (IPC) protocol. 

Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-05-17 Thread Matěj Týč
On 11.5.2016 10:29, Sturla Molden wrote:
> I did some work on this some years ago. ...
>
I am sorry, I have missed this discussion when it started.

There are two cases when I had feeling that I had to use this functionality:

 - Parallel processing of HUGE data, and

 - using parallel processing in an application that had plug-ins which
operated on one shared array (that was updated every one and then - it
was a producer-consumer pattern thing). As everything got set up, it
worked like a charm.

The thing I especially like about the proposed module is the lack of
external dependencies + it works if one knows how to use it.

The bad thing about it is its fragility - I admit that using it as it is
is not particularly intuitive. Unlike Sturla, I think that this is not a
dead end, but it indeed feels clumsy. However, I dislike the necessity
of writing Cython or C to get true multithreading for reasons I have
mentioned - what if you want to run high-level Python functions in parallel?

So, what I would really like to see is some kind of numpy documentation
on how to approach parallel computing with numpy arrays (depending on
what kind of task one wants to achieve). Maybe just using the queue is
good enough, or there are those 3-rd party modules with known
limitations? Plenty of people start off with numpy, so some kind of
overview should be part of numpy docs.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-17 Thread Robert Kern
On Tue, May 17, 2016 at 9:09 AM, Stephan Hoyer  wrote:
>
> On Tue, May 17, 2016 at 12:18 AM, Robert Kern 
wrote:
>>
>> On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer  wrote:
>> > 1. When writing a library of stochastic functions that take a seed as
an input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].
>>
>> Can you clarify the use case here? I don't really know what you are
doing here, but I'm pretty sure this is not the right approach.
>
> Here's a contrived example. Suppose I've written a simulator for cars
that consists of a number of loosely connected components (e.g., an engine,
brakes, etc.). The behavior of each component of our simulator is
stochastic, but we want everything to be fully reproducible, so we need to
use seeds or RandomState objects.
>
> We might write our simulate_car function like the following:
>
> def simulate_car(engine_config, brakes_config, seed=None):
> rs = np.random.RandomState(seed)
> engine = simulate_engine(engine_config, seed=rs.random_seed())
> brakes = simulate_brakes(brakes_config, seed=rs.random_seed())
> ...
>
> The problem with passing the same RandomState object (either explicitly
or dropping the seed argument entirely and using the  global state) to both
simulate_engine and simulate_breaks is that it breaks encapsulation -- if I
change what I do inside simulate_engine, it also effects the brakes.

That's a little too contrived, IMO. In most such simulations, the different
components interact with each other in the normal course of the simulation;
that's why they are both joined together in the same simulation instead of
being two separate runs. Unless if the components are being run across a
process or thread boundary (a la dask below) where true nondeterminism
comes into play, then I don't think you want these semi-independent
streams. This seems to be the advice du jour from the agent-based modeling
community.

> The dask use case is actually pretty different -- the intent is to create
many random numbers in parallel using multiple threads or processes
(possibly in a distributed fashion). I know that skipping ahead is the
standard way to get independent number streams for parallel sampling, but
that isn't exposed in numpy.random, and setting distinct seeds seems like a
reasonable alternative for scientific computing use cases.

Forget about integer seeds. Those are for human convenience. If you're not
jotting them down in your lab notebook in pen, you don't want an integer
seed.

What you want is a function that returns many RandomState objects that are
hopefully spread around the MT19937 space enough that they are essentially
independent (in the absence of true jumpahead). The better implementation
of such a function would look something like this:

def spread_out_prngs(n, root_prng=None):
if root_prng is None:
root_prng = np.random
elif not isinstance(root_prng, np.random.RandomState):
root_prng = np.random.RandomState(root_prng)
sprouted_prngs = []
for i in range(n):
seed_array = root_prng.randint(1<<32, size=624)  # dtype=np.uint32
under 1.11
sprouted_prngs.append(np.random.RandomState(seed_array))
return spourted_prngs

Internally, this generates seed arrays of about the size of the MT19937
state so make sure that you can access more of the state space. That will
at least make the chance of collision tiny. And it can be easily rewritten
to take advantage of one of the newer PRNGs that have true independent
streams:

  https://github.com/bashtage/ng-numpy-randomstate

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-17 Thread Stephan Hoyer
On Tue, May 17, 2016 at 12:18 AM, Robert Kern  wrote:

> On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer  wrote:
> > 1. When writing a library of stochastic functions that take a seed as an
> input argument, and some of these functions call multiple other such
> stochastic functions. Dask is one such example [1].
>
> Can you clarify the use case here? I don't really know what you are doing
> here, but I'm pretty sure this is not the right approach.
>

Here's a contrived example. Suppose I've written a simulator for cars that
consists of a number of loosely connected components (e.g., an engine,
brakes, etc.). The behavior of each component of our simulator is
stochastic, but we want everything to be fully reproducible, so we need to
use seeds or RandomState objects.

We might write our simulate_car function like the following:

def simulate_car(engine_config, brakes_config, seed=None):
rs = np.random.RandomState(seed)
engine = simulate_engine(engine_config, seed=rs.random_seed())
brakes = simulate_brakes(brakes_config, seed=rs.random_seed())
...

The problem with passing the same RandomState object (either explicitly or
dropping the seed argument entirely and using the  global state) to both
simulate_engine and simulate_breaks is that it breaks encapsulation -- if I
change what I do inside simulate_engine, it also effects the brakes.

The dask use case is actually pretty different -- the intent is to create
many random numbers in parallel using multiple threads or processes
(possibly in a distributed fashion). I know that skipping ahead is the
standard way to get independent number streams for parallel sampling, but
that isn't exposed in numpy.random, and setting distinct seeds seems like a
reasonable alternative for scientific computing use cases.


> It's only pseudo-private. This is an authorized use of it.
>
> However, for this case, I usually just pass around the the numpy.random
> module itself and let duck-typing take care of the rest.
>

I like the duck-typing approach. That's very elegant.

If this is an authorized use of the global RandomState object, let's
document it! Otherwise cautious library maintainers like myself will
discourage using it :).


> > [3] On a side note, if there's no longer a good reason to keep this
> object private, perhaps we should expose it in our public API. It would
> certainly be useful -- scikit-learn is already using it (see links in the
> pandas PR above).
>
> Adding a public get_global_random_state() function might be in order.
>

Yes, possibly.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-17 Thread Robert Kern
On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer  wrote:
>
> I have recently encountered several use cases for randomly generate
random number seeds:
>
> 1. When writing a library of stochastic functions that take a seed as an
input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].

Can you clarify the use case here? I don't really know what you are doing
here, but I'm pretty sure this is not the right approach.

> 2. When a library needs to produce results that are reproducible after
calling numpy.random.seed, but that do not want to use the functions in
numpy.random directly. This came up recently in a pandas pull request [2],
because we want to allow using RandomState objects as an alternative to
global state in numpy.random. A major advantage of this approach is that it
provides an obvious alternative to reusing the private numpy.random._mtrand
[3].

It's only pseudo-private. This is an authorized use of it.

However, for this case, I usually just pass around the the numpy.random
module itself and let duck-typing take care of the rest.

> [3] On a side note, if there's no longer a good reason to keep this
object private, perhaps we should expose it in our public API. It would
certainly be useful -- scikit-learn is already using it (see links in the
pandas PR above).

Adding a public get_global_random_state() function might be in order.
Originally, I wanted there to be *some* barrier to entry, but just grabbing
it to use as a default RandomState object is definitely an intended use of
it. It's not going to disappear.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion