Re: [Numpy-discussion] Proposal: numpy.random.random_seed
On Tue, May 17, 2016 at 10:41 AM, Robert Kern wrote: > On Tue, May 17, 2016 at 6:24 PM, Nathaniel Smith wrote: >> >> On May 17, 2016 1:50 AM, "Robert Kern" wrote: >> > >> [...] >> > What you want is a function that returns many RandomState objects that >> > are hopefully spread around the MT19937 space enough that they are >> > essentially independent (in the absence of true jumpahead). The better >> > implementation of such a function would look something like this: >> > >> > def spread_out_prngs(n, root_prng=None): >> > if root_prng is None: >> > root_prng = np.random >> > elif not isinstance(root_prng, np.random.RandomState): >> > root_prng = np.random.RandomState(root_prng) >> > sprouted_prngs = [] >> > for i in range(n): >> > seed_array = root_prng.randint(1<<32, size=624) # >> > dtype=np.uint32 under 1.11 >> > sprouted_prngs.append(np.random.RandomState(seed_array)) >> > return spourted_prngs >> >> Maybe a nice way to encapsulate this in the RandomState interface would be >> a method RandomState.random_state() that generates and returns a new child >> RandomState. > > I disagree. This is a workaround in the absence of proper jumpahead or > guaranteed-independent streams. I would not encourage it. > >> > Internally, this generates seed arrays of about the size of the MT19937 >> > state so make sure that you can access more of the state space. That will >> > at >> > least make the chance of collision tiny. And it can be easily rewritten to >> > take advantage of one of the newer PRNGs that have true independent >> > streams: >> > >> > https://github.com/bashtage/ng-numpy-randomstate >> >> ... But unfortunately I'm not sure how to make my interface suggestion >> above work on top of one of these RNGs, because for RandomState.random_state >> you really want a tree of independent RNGs and the fancy new PRNGs only >> provide a single flat namespace :-/. And even more annoyingly, the tree API >> is actually a nicer API, because with a flat namespace you have to know up >> front about all possible RNGs your code will use, which is an unfortunate >> global coupling that makes it difficult to compose programs out of >> independent pieces, while the RandomState.random_state approach composes >> beautifully. Maybe there's some clever way to allocate a 64-bit namespace to >> make it look tree-like? I'm not sure 64 bits is really enough... > > MT19937 doesn't have a "tree" any more than the others. It's the same flat > state space. You are just getting the illusion of a tree by hoping that you > never collide. You ought to think about precisely the same global coupling > issues with MT19937 as you do with guaranteed-independent streams. > Hope-and-prayer isn't really a substitute for properly engineering your > problem. It's just a moral hazard to promote this method to the main API. Nonsense. If your definition of "hope and prayer" includes assuming that we won't encounter a random collision in a 2**19937 state space, then literally all engineering is hope-and-prayer. A collision could happen, but if it does it's overwhelmingly more likely to happen because of a flaw in the mathematical analysis, or a bug in the implementation, or because random quantum fluctuations caused you and your program to suddenly be transported to a parallel world where 1 + 1 = 1, than that you just got unlucky with your random state. And all of these hazards apply equally to both MT19937 and more modern PRNGs. ...anyway, the real reason I'm a bit grumpy is because there are solid engineering reasons why users *want* this API, so whether or not it turns out to be possible I think we should at least be allowed to have a discussion about whether there's some way to give it to them. It's not even 100% out of the question that we conclude that existing PRNGs are buggy because they don't take this use case into account -- it would be far from the first time that numpy found itself going beyond the limits of older numerical tools that weren't designed to build the kind of large composable systems that numpy gets used for. MT19937's state space is large enough that you could explicitly encode a "tree seed" into it, even if you don't trust the laws of probability -- e.g., you start with a RandomState with id [], then its children have id [0], [1], [2], ..., and their children have ids [0, 0], [0, 1], ..., [1, 0], ..., and you write these into the state (probably after sending them through some bit-mixing permutation), to guarantee non-collision. Or if you do trust the laws of probability, then the randomly-sample-a-PRNG approach is not 100% out of the question even for more modern PRNGs. For example, let's take a PRNG with a 64-bit stream id and a 64-bit state space per stream. Suppose that we know that in our application each PRNG will be used to draw 2**48 64-bit samples (= 2 pebibytes of random data), and that we will use 2**32 PRNGs (= total of 8 yobibytes of random data). If w
Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)
Matěj Týč wrote: > Does it mean > that if you pass the numpy array to the child process using Queue, no > significant amount of data will flow through it? This is what my shared memory arrayes do. > Or I shouldn't pass it > using Queue at all and just rely on inheritance? This is what David Baddeley's shared memory arrays do. > Finally, I assume that > passing it as an argument to the Process class is the worst option, > because it will be pickled and unpickled. My shared memory arrays only pickles the metadata, and can be used in this way. > Or maybe you refer to modules s.a. joblib that use this functionality > and expose only a nice interface? Joblib creates "share memory" by memory mapping a temporary file, which is back by RAM on Libux (tempfs). It is backed by a physical file on disk on Mac and Windows. In this resepect, joblib is much better on Linux than Mac or Windows. > And finally, cow means that returning large arrays still involves data > moving between processes, whereas the shm approach has the workaround > that you can preallocate the result array by the parent process, where > the worker process can write to. My shared memory arrays need no workaround dor this. They also allow shared memory arrays to be returned to the parent process. No preallocation is needed. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)
On 17.5.2016 14:13, Sturla Molden wrote: > Matěj Týč wrote: > >> - Parallel processing of HUGE data, and > This is mainly a Windows problem, as copy-on-write fork() will solve this > on any other platform. ... That sounds interesting, could you elaborate on it a bit? Does it mean that if you pass the numpy array to the child process using Queue, no significant amount of data will flow through it? Or I shouldn't pass it using Queue at all and just rely on inheritance? Finally, I assume that passing it as an argument to the Process class is the worst option, because it will be pickled and unpickled. Or maybe you refer to modules s.a. joblib that use this functionality and expose only a nice interface? And finally, cow means that returning large arrays still involves data moving between processes, whereas the shm approach has the workaround that you can preallocate the result array by the parent process, where the worker process can write to. > What this means is that shared memory is seldom useful for sharing huge > data, even on Windows. It is only useful for this on Unix/Linux, where base > addresses can stay they same. But on non-Windows platforms, the COW will in > 99.99% of the cases be sufficient, thus make shared memory superfluous > anyway. We don't need shared memory to scatter large data on Linux, only > fork. I am actually quite comfortable with sharing numpy arrays only. It is a nice format for sharing large amounts of numbers, which is what I want and what many modules accept as input (e.g. the "shapely" module). ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: numpy.random.random_seed
On Tue, May 17, 2016 at 6:24 PM, Nathaniel Smith wrote: > > On May 17, 2016 1:50 AM, "Robert Kern" wrote: > > > [...] > > What you want is a function that returns many RandomState objects that are hopefully spread around the MT19937 space enough that they are essentially independent (in the absence of true jumpahead). The better implementation of such a function would look something like this: > > > > def spread_out_prngs(n, root_prng=None): > > if root_prng is None: > > root_prng = np.random > > elif not isinstance(root_prng, np.random.RandomState): > > root_prng = np.random.RandomState(root_prng) > > sprouted_prngs = [] > > for i in range(n): > > seed_array = root_prng.randint(1<<32, size=624) # dtype=np.uint32 under 1.11 > > sprouted_prngs.append(np.random.RandomState(seed_array)) > > return spourted_prngs > > Maybe a nice way to encapsulate this in the RandomState interface would be a method RandomState.random_state() that generates and returns a new child RandomState. I disagree. This is a workaround in the absence of proper jumpahead or guaranteed-independent streams. I would not encourage it. > > Internally, this generates seed arrays of about the size of the MT19937 state so make sure that you can access more of the state space. That will at least make the chance of collision tiny. And it can be easily rewritten to take advantage of one of the newer PRNGs that have true independent streams: > > > > https://github.com/bashtage/ng-numpy-randomstate > > ... But unfortunately I'm not sure how to make my interface suggestion above work on top of one of these RNGs, because for RandomState.random_state you really want a tree of independent RNGs and the fancy new PRNGs only provide a single flat namespace :-/. And even more annoyingly, the tree API is actually a nicer API, because with a flat namespace you have to know up front about all possible RNGs your code will use, which is an unfortunate global coupling that makes it difficult to compose programs out of independent pieces, while the RandomState.random_state approach composes beautifully. Maybe there's some clever way to allocate a 64-bit namespace to make it look tree-like? I'm not sure 64 bits is really enough... MT19937 doesn't have a "tree" any more than the others. It's the same flat state space. You are just getting the illusion of a tree by hoping that you never collide. You ought to think about precisely the same global coupling issues with MT19937 as you do with guaranteed-independent streams. Hope-and-prayer isn't really a substitute for properly engineering your problem. It's just a moral hazard to promote this method to the main API. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: numpy.random.random_seed
On May 17, 2016 1:50 AM, "Robert Kern" wrote: > [...] > What you want is a function that returns many RandomState objects that are hopefully spread around the MT19937 space enough that they are essentially independent (in the absence of true jumpahead). The better implementation of such a function would look something like this: > > def spread_out_prngs(n, root_prng=None): > if root_prng is None: > root_prng = np.random > elif not isinstance(root_prng, np.random.RandomState): > root_prng = np.random.RandomState(root_prng) > sprouted_prngs = [] > for i in range(n): > seed_array = root_prng.randint(1<<32, size=624) # dtype=np.uint32 under 1.11 > sprouted_prngs.append(np.random.RandomState(seed_array)) > return spourted_prngs Maybe a nice way to encapsulate this in the RandomState interface would be a method RandomState.random_state() that generates and returns a new child RandomState. > Internally, this generates seed arrays of about the size of the MT19937 state so make sure that you can access more of the state space. That will at least make the chance of collision tiny. And it can be easily rewritten to take advantage of one of the newer PRNGs that have true independent streams: > > https://github.com/bashtage/ng-numpy-randomstate ... But unfortunately I'm not sure how to make my interface suggestion above work on top of one of these RNGs, because for RandomState.random_state you really want a tree of independent RNGs and the fancy new PRNGs only provide a single flat namespace :-/. And even more annoyingly, the tree API is actually a nicer API, because with a flat namespace you have to know up front about all possible RNGs your code will use, which is an unfortunate global coupling that makes it difficult to compose programs out of independent pieces, while the RandomState.random_state approach composes beautifully. Maybe there's some clever way to allocate a 64-bit namespace to make it look tree-like? I'm not sure 64 bits is really enough... -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: numpy.random.random_seed
On Tue, May 17, 2016 at 9:40 AM, Sturla Molden wrote: > Stephan Hoyer wrote: > > I have recently encountered several use cases for randomly generate > random > > number seeds: > > > > 1. When writing a library of stochastic functions that take a seed as an > > input argument, and some of these functions call multiple other such > > stochastic functions. Dask is one such example [1]. > > > > 2. When a library needs to produce results that are reproducible after > > calling numpy.random.seed, but that do not want to use the functions in > > numpy.random directly. This came up recently in a pandas pull request > [2], > > because we want to allow using RandomState objects as an alternative to > > global state in numpy.random. A major advantage of this approach is that > it > > provides an obvious alternative to reusing the private > numpy.random._mtrand > > [3]. > > > What about making numpy.random a finite state machine, and keeping a stack > of RandomState seeds? That is, something similar to what OpenGL does for > its matrices? Then we get two functions, numpy.random.push_seed and > numpy.random.pop_seed. > I don't like the idea of adding this kind of internal state. Having it built into the module means that it is shared by all callers, libraries user code etc. That's not the right choice when a stack of seeds could be easily built around the RandomState object if that is really what someone needs. Eric ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: numpy.random.random_seed
On Tue, May 17, 2016 at 2:40 PM, Sturla Molden wrote: > > Stephan Hoyer wrote: > > I have recently encountered several use cases for randomly generate random > > number seeds: > > > > 1. When writing a library of stochastic functions that take a seed as an > > input argument, and some of these functions call multiple other such > > stochastic functions. Dask is one such example [1]. > > > > 2. When a library needs to produce results that are reproducible after > > calling numpy.random.seed, but that do not want to use the functions in > > numpy.random directly. This came up recently in a pandas pull request [2], > > because we want to allow using RandomState objects as an alternative to > > global state in numpy.random. A major advantage of this approach is that it > > provides an obvious alternative to reusing the private numpy.random._mtrand > > [3]. > > What about making numpy.random a finite state machine, and keeping a stack > of RandomState seeds? That is, something similar to what OpenGL does for > its matrices? Then we get two functions, numpy.random.push_seed and > numpy.random.pop_seed. I don't think that addresses the issues brought up here. It's just more global state to worry about. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: numpy.random.random_seed
Stephan Hoyer wrote: > I have recently encountered several use cases for randomly generate random > number seeds: > > 1. When writing a library of stochastic functions that take a seed as an > input argument, and some of these functions call multiple other such > stochastic functions. Dask is one such example [1]. > > 2. When a library needs to produce results that are reproducible after > calling numpy.random.seed, but that do not want to use the functions in > numpy.random directly. This came up recently in a pandas pull request [2], > because we want to allow using RandomState objects as an alternative to > global state in numpy.random. A major advantage of this approach is that it > provides an obvious alternative to reusing the private numpy.random._mtrand > [3]. What about making numpy.random a finite state machine, and keeping a stack of RandomState seeds? That is, something similar to what OpenGL does for its matrices? Then we get two functions, numpy.random.push_seed and numpy.random.pop_seed. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: numpy.random.random_seed
On Tue, May 17, 2016 at 4:49 AM, Robert Kern wrote: > On Tue, May 17, 2016 at 9:09 AM, Stephan Hoyer wrote: > > > > On Tue, May 17, 2016 at 12:18 AM, Robert Kern > wrote: > >> > >> On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer > wrote: > >> > 1. When writing a library of stochastic functions that take a seed as > an input argument, and some of these functions call multiple other such > stochastic functions. Dask is one such example [1]. > >> > >> Can you clarify the use case here? I don't really know what you are > doing here, but I'm pretty sure this is not the right approach. > > > > Here's a contrived example. Suppose I've written a simulator for cars > that consists of a number of loosely connected components (e.g., an engine, > brakes, etc.). The behavior of each component of our simulator is > stochastic, but we want everything to be fully reproducible, so we need to > use seeds or RandomState objects. > > > > We might write our simulate_car function like the following: > > > > def simulate_car(engine_config, brakes_config, seed=None): > > rs = np.random.RandomState(seed) > > engine = simulate_engine(engine_config, seed=rs.random_seed()) > > brakes = simulate_brakes(brakes_config, seed=rs.random_seed()) > > ... > > > > The problem with passing the same RandomState object (either explicitly > or dropping the seed argument entirely and using the global state) to both > simulate_engine and simulate_breaks is that it breaks encapsulation -- if I > change what I do inside simulate_engine, it also effects the brakes. > > That's a little too contrived, IMO. In most such simulations, the > different components interact with each other in the normal course of the > simulation; that's why they are both joined together in the same simulation > instead of being two separate runs. Unless if the components are being run > across a process or thread boundary (a la dask below) where true > nondeterminism comes into play, then I don't think you want these > semi-independent streams. This seems to be the advice du jour from the > agent-based modeling community. > similar usecase where I had to switch to using several RandomStates In a Monte Carlo experiment with increasing sample size, I want two random variables, x, y, to have the same the same draws in the common initial observations. If I draw x and y sequentially, and then increase the number of observations for the simulation, then it completely changes the draws for second variable if they use a common RandomState. With separate random states, increasing from 1000 to 1200 observations, leaves the first 1000 draws unchanged. (This reduces the Monte Carlo noise for example when calculating the power of a hypothesis test as function of the sample size.) Josef > > > > The dask use case is actually pretty different -- the intent is to > create many random numbers in parallel using multiple threads or processes > (possibly in a distributed fashion). I know that skipping ahead is the > standard way to get independent number streams for parallel sampling, but > that isn't exposed in numpy.random, and setting distinct seeds seems like a > reasonable alternative for scientific computing use cases. > > Forget about integer seeds. Those are for human convenience. If you're not > jotting them down in your lab notebook in pen, you don't want an integer > seed. > > What you want is a function that returns many RandomState objects that are > hopefully spread around the MT19937 space enough that they are essentially > independent (in the absence of true jumpahead). The better implementation > of such a function would look something like this: > > def spread_out_prngs(n, root_prng=None): > if root_prng is None: > root_prng = np.random > elif not isinstance(root_prng, np.random.RandomState): > root_prng = np.random.RandomState(root_prng) > sprouted_prngs = [] > for i in range(n): > seed_array = root_prng.randint(1<<32, size=624) # dtype=np.uint32 > under 1.11 > sprouted_prngs.append(np.random.RandomState(seed_array)) > return spourted_prngs > > Internally, this generates seed arrays of about the size of the MT19937 > state so make sure that you can access more of the state space. That will > at least make the chance of collision tiny. And it can be easily rewritten > to take advantage of one of the newer PRNGs that have true independent > streams: > > https://github.com/bashtage/ng-numpy-randomstate > > -- > Robert Kern > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)
Matěj Týč wrote: > - Parallel processing of HUGE data, and This is mainly a Windows problem, as copy-on-write fork() will solve this on any other platform. I am more in favor of asking Microsoft to fix their broken OS. Also observe that the usefulness of shared memory is very limited on Windows, as we in practice never get the same base address in a spawned process. This prevents sharing data structures with pointers and Python objects. Anything more complex than an array cannot be shared. What this means is that shared memory is seldom useful for sharing huge data, even on Windows. It is only useful for this on Unix/Linux, where base addresses can stay they same. But on non-Windows platforms, the COW will in 99.99% of the cases be sufficient, thus make shared memory superfluous anyway. We don't need shared memory to scatter large data on Linux, only fork. As I see it. shared memory is mostly useful as a means to construct an inter-process communication (IPC) protocol. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)
On 11.5.2016 10:29, Sturla Molden wrote: > I did some work on this some years ago. ... > I am sorry, I have missed this discussion when it started. There are two cases when I had feeling that I had to use this functionality: - Parallel processing of HUGE data, and - using parallel processing in an application that had plug-ins which operated on one shared array (that was updated every one and then - it was a producer-consumer pattern thing). As everything got set up, it worked like a charm. The thing I especially like about the proposed module is the lack of external dependencies + it works if one knows how to use it. The bad thing about it is its fragility - I admit that using it as it is is not particularly intuitive. Unlike Sturla, I think that this is not a dead end, but it indeed feels clumsy. However, I dislike the necessity of writing Cython or C to get true multithreading for reasons I have mentioned - what if you want to run high-level Python functions in parallel? So, what I would really like to see is some kind of numpy documentation on how to approach parallel computing with numpy arrays (depending on what kind of task one wants to achieve). Maybe just using the queue is good enough, or there are those 3-rd party modules with known limitations? Plenty of people start off with numpy, so some kind of overview should be part of numpy docs. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: numpy.random.random_seed
On Tue, May 17, 2016 at 9:09 AM, Stephan Hoyer wrote: > > On Tue, May 17, 2016 at 12:18 AM, Robert Kern wrote: >> >> On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer wrote: >> > 1. When writing a library of stochastic functions that take a seed as an input argument, and some of these functions call multiple other such stochastic functions. Dask is one such example [1]. >> >> Can you clarify the use case here? I don't really know what you are doing here, but I'm pretty sure this is not the right approach. > > Here's a contrived example. Suppose I've written a simulator for cars that consists of a number of loosely connected components (e.g., an engine, brakes, etc.). The behavior of each component of our simulator is stochastic, but we want everything to be fully reproducible, so we need to use seeds or RandomState objects. > > We might write our simulate_car function like the following: > > def simulate_car(engine_config, brakes_config, seed=None): > rs = np.random.RandomState(seed) > engine = simulate_engine(engine_config, seed=rs.random_seed()) > brakes = simulate_brakes(brakes_config, seed=rs.random_seed()) > ... > > The problem with passing the same RandomState object (either explicitly or dropping the seed argument entirely and using the global state) to both simulate_engine and simulate_breaks is that it breaks encapsulation -- if I change what I do inside simulate_engine, it also effects the brakes. That's a little too contrived, IMO. In most such simulations, the different components interact with each other in the normal course of the simulation; that's why they are both joined together in the same simulation instead of being two separate runs. Unless if the components are being run across a process or thread boundary (a la dask below) where true nondeterminism comes into play, then I don't think you want these semi-independent streams. This seems to be the advice du jour from the agent-based modeling community. > The dask use case is actually pretty different -- the intent is to create many random numbers in parallel using multiple threads or processes (possibly in a distributed fashion). I know that skipping ahead is the standard way to get independent number streams for parallel sampling, but that isn't exposed in numpy.random, and setting distinct seeds seems like a reasonable alternative for scientific computing use cases. Forget about integer seeds. Those are for human convenience. If you're not jotting them down in your lab notebook in pen, you don't want an integer seed. What you want is a function that returns many RandomState objects that are hopefully spread around the MT19937 space enough that they are essentially independent (in the absence of true jumpahead). The better implementation of such a function would look something like this: def spread_out_prngs(n, root_prng=None): if root_prng is None: root_prng = np.random elif not isinstance(root_prng, np.random.RandomState): root_prng = np.random.RandomState(root_prng) sprouted_prngs = [] for i in range(n): seed_array = root_prng.randint(1<<32, size=624) # dtype=np.uint32 under 1.11 sprouted_prngs.append(np.random.RandomState(seed_array)) return spourted_prngs Internally, this generates seed arrays of about the size of the MT19937 state so make sure that you can access more of the state space. That will at least make the chance of collision tiny. And it can be easily rewritten to take advantage of one of the newer PRNGs that have true independent streams: https://github.com/bashtage/ng-numpy-randomstate -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: numpy.random.random_seed
On Tue, May 17, 2016 at 12:18 AM, Robert Kern wrote: > On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer wrote: > > 1. When writing a library of stochastic functions that take a seed as an > input argument, and some of these functions call multiple other such > stochastic functions. Dask is one such example [1]. > > Can you clarify the use case here? I don't really know what you are doing > here, but I'm pretty sure this is not the right approach. > Here's a contrived example. Suppose I've written a simulator for cars that consists of a number of loosely connected components (e.g., an engine, brakes, etc.). The behavior of each component of our simulator is stochastic, but we want everything to be fully reproducible, so we need to use seeds or RandomState objects. We might write our simulate_car function like the following: def simulate_car(engine_config, brakes_config, seed=None): rs = np.random.RandomState(seed) engine = simulate_engine(engine_config, seed=rs.random_seed()) brakes = simulate_brakes(brakes_config, seed=rs.random_seed()) ... The problem with passing the same RandomState object (either explicitly or dropping the seed argument entirely and using the global state) to both simulate_engine and simulate_breaks is that it breaks encapsulation -- if I change what I do inside simulate_engine, it also effects the brakes. The dask use case is actually pretty different -- the intent is to create many random numbers in parallel using multiple threads or processes (possibly in a distributed fashion). I know that skipping ahead is the standard way to get independent number streams for parallel sampling, but that isn't exposed in numpy.random, and setting distinct seeds seems like a reasonable alternative for scientific computing use cases. > It's only pseudo-private. This is an authorized use of it. > > However, for this case, I usually just pass around the the numpy.random > module itself and let duck-typing take care of the rest. > I like the duck-typing approach. That's very elegant. If this is an authorized use of the global RandomState object, let's document it! Otherwise cautious library maintainers like myself will discourage using it :). > > [3] On a side note, if there's no longer a good reason to keep this > object private, perhaps we should expose it in our public API. It would > certainly be useful -- scikit-learn is already using it (see links in the > pandas PR above). > > Adding a public get_global_random_state() function might be in order. > Yes, possibly. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: numpy.random.random_seed
On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer wrote: > > I have recently encountered several use cases for randomly generate random number seeds: > > 1. When writing a library of stochastic functions that take a seed as an input argument, and some of these functions call multiple other such stochastic functions. Dask is one such example [1]. Can you clarify the use case here? I don't really know what you are doing here, but I'm pretty sure this is not the right approach. > 2. When a library needs to produce results that are reproducible after calling numpy.random.seed, but that do not want to use the functions in numpy.random directly. This came up recently in a pandas pull request [2], because we want to allow using RandomState objects as an alternative to global state in numpy.random. A major advantage of this approach is that it provides an obvious alternative to reusing the private numpy.random._mtrand [3]. It's only pseudo-private. This is an authorized use of it. However, for this case, I usually just pass around the the numpy.random module itself and let duck-typing take care of the rest. > [3] On a side note, if there's no longer a good reason to keep this object private, perhaps we should expose it in our public API. It would certainly be useful -- scikit-learn is already using it (see links in the pandas PR above). Adding a public get_global_random_state() function might be in order. Originally, I wanted there to be *some* barrier to entry, but just grabbing it to use as a default RandomState object is definitely an intended use of it. It's not going to disappear. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion