Why broadcasting this list then? You should use an RDD or DataFrame. For
example, RDD has a method sample() that returns a random sample from it.
On 11 April 2018 at 22:34, surender kumar <skiit...@yahoo.co.uk.invalid>
> I'm using pySpark.
> I've list of 1 million items (all float values ) and 1 million users. for
> each user I want to sample randomly some items from the item list.
> Broadcasting the item list results in Outofmemory error on the driver,
> tried setting driver memory till 10G. I tried to persist this array on
> disk but I'm not able to figure out a way to read the same on the workers.
> Any suggestion would be appreciated.