The python workers used for each stage may be different, this may not work as expected.
You can create a Random object, set the seed, use it to do the shuffle(). r = random.Random() r.seek(my_seed) def f(x): r.shuffle(l) rdd.map(f) On Thu, May 14, 2015 at 6:21 AM, Charles Hayden <[email protected]> wrote: > Thanks for the reply. > > > I have not tried it out (I will today and report on my results) but I think > what I need to do is to call mapPartitions and pass it a function that sets > the seed. I was planning to pass the seed value in the closure. > > > Something like: > > my_seed = 42 > def f(iterator): > random.seed(my_seed) > yield my_seed > rdd.mapPartitions(f) > > > ________________________________ > From: ayan guha <[email protected]> > Sent: Thursday, May 14, 2015 2:29 AM > > To: Charles Hayden > Cc: user > Subject: Re: how to set random seed > > Sorry for late reply. > > Here is what I was thinking.... > > import random as r > def main(): > ....get SparkContext > #Just for fun, lets assume seed is an id > filename="bin.dat" > seed = id(filename) > #broadcast it > br = sc.broadcast(seed) > > #set up dummy list > lst = [] > for i in range(4): > x=[] > for j in range(4): > x.append(j) > lst.append(x) > print lst > base = sc.parallelize(lst) > print base.map(randomize).collect() > > Randomize looks like > def randomize(lst): > local_seed = br.value > r.seed(local_seed) > r.shuffle(lst) > return lst > > > Let me know if this helps... > > > > > base = sc.parallelize(lst) > print base.map(randomize).collect() > > On Wed, May 13, 2015 at 11:41 PM, Charles Hayden <[email protected]> > wrote: >> >> Can you elaborate? Broadcast will distribute the seed, which is only one >> number. But what construct do I use to "plant" the seed (call >> random.seed()) once on each worker? >> >> ________________________________ >> From: ayan guha <[email protected]> >> Sent: Tuesday, May 12, 2015 11:17 PM >> To: Charles Hayden >> Cc: user >> Subject: Re: how to set random seed >> >> >> Easiest way is to broadcast it. >> >> On 13 May 2015 10:40, "Charles Hayden" <[email protected]> wrote: >>> >>> In pySpark, I am writing a map with a lambda that calls random.shuffle. >>> For testing, I want to be able to give it a seed, so that successive runs >>> will produce the same shuffle. >>> I am looking for a way to set this same random seed once on each worker. >>> Is there any simple way to do it? >>> >>> > > > > -- > Best Regards, > Ayan Guha --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
