[ https://issues.apache.org/jira/browse/SPARK-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng resolved SPARK-4148. ---------------------------------- Resolution: Fixed Fix Version/s: 1.1.1 Issue resolved by pull request 3104 [https://github.com/apache/spark/pull/3104] > PySpark's sample uses the same seed for all partitions > ------------------------------------------------------ > > Key: SPARK-4148 > URL: https://issues.apache.org/jira/browse/SPARK-4148 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.0.2, 1.1.0 > Reporter: Xiangrui Meng > Assignee: Xiangrui Meng > Fix For: 1.1.1 > > > The current way of seed distribution makes the random sequences from > partition i and i+1 offset by 1. > {code} > In [14]: import random > In [15]: r1 = random.Random(10) > In [16]: r1.randint(0, 1) > Out[16]: 1 > In [17]: r1.random() > Out[17]: 0.4288890546751146 > In [18]: r1.random() > Out[18]: 0.5780913011344704 > In [19]: r2 = random.Random(10) > In [20]: r2.randint(0, 1) > Out[20]: 1 > In [21]: r2.randint(0, 1) > Out[21]: 0 > In [22]: r2.random() > Out[22]: 0.5780913011344704 > {code} > So the second value from partition 1 is the same as the first value from > partition 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org