Thanks for the reply.
I have not tried it out (I will today and report on my results) but I think
what I need to do is to call mapPartitions and pass it a function that sets the
seed. I was planning to pass the seed value in the closure.
Something like:
my_seed = 42
def f(iterator):
random.seed(my_seed)
yield my_seed
rdd.mapPartitions(f)
________________________________
From: ayan guha <[email protected]>
Sent: Thursday, May 14, 2015 2:29 AM
To: Charles Hayden
Cc: user
Subject: Re: how to set random seed
Sorry for late reply.
Here is what I was thinking....
import random as r
def main():
....get SparkContext
#Just for fun, lets assume seed is an id
filename="bin.dat"
seed = id(filename)
#broadcast it
br = sc.broadcast(seed)
#set up dummy list
lst = []
for i in range(4):
x=[]
for j in range(4):
x.append(j)
lst.append(x)
print lst
base = sc.parallelize(lst)
print base.map(randomize).collect()
Randomize looks like
def randomize(lst):
local_seed = br.value
r.seed(local_seed)
r.shuffle(lst)
return lst
Let me know if this helps...
base = sc.parallelize(lst)
print base.map(randomize).collect()
On Wed, May 13, 2015 at 11:41 PM, Charles Hayden
<[email protected]<mailto:[email protected]>> wrote:
?Can you elaborate? Broadcast will distribute the seed, which is only one
number. But what construct do I use to "plant" the seed (call random.seed())
once on each worker?
________________________________
From: ayan guha <[email protected]<mailto:[email protected]>>
Sent: Tuesday, May 12, 2015 11:17 PM
To: Charles Hayden
Cc: user
Subject: Re: how to set random seed
Easiest way is to broadcast it.
On 13 May 2015 10:40, "Charles Hayden"
<[email protected]<mailto:[email protected]>> wrote:
In pySpark, I am writing a map with a lambda that calls random.shuffle.
For testing, I want to be able to give it a seed, so that successive runs will
produce the same shuffle.
I am looking for a way to set this same random seed once on each worker. Is
there any simple way to do it??
--
Best Regards,
Ayan Guha