[jira] [Commented] (SPARK-12662) Add document to randomSplit to explain the sampling depends on the ordering of the rows in a partition
[ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085007#comment-15085007 ] Reynold Xin commented on SPARK-12662: - Yea [~yhuai] and I talked offline and thought just adding a local sort would be a better solution. It'd make performance worse, but at least guarantee correctness. > Add document to randomSplit to explain the sampling depends on the ordering > of the rows in a partition > -- > > Key: SPARK-12662 > URL: https://issues.apache.org/jira/browse/SPARK-12662 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Yin Huai >Assignee: Sameer Agarwal > > With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following > code will provide overlapped rows for two DFs returned by the randomSplit. > {code} > sqlContext.sql("drop table if exists test") > val x = sc.parallelize(1 to 210) > case class R(ID : Int) > sqlContext.createDataFrame(x.map > {R(_)}).write.format("json").saveAsTable("bugsc1597") > var df = sql("select distinct ID from test") > var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) > a.registerTempTable("a") > b.registerTempTable("b") > val intersectDF = a.intersect(b) > intersectDF.show > {code} > The reason is that {{sql("select distinct ID from test")} does not guarantee > the ordering rows in a partition. It will be good to add more document to the > api doc to explain it. To make intersectDF contain 0 row, the df needs to > have fixed row ordering within a partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12662) Add document to randomSplit to explain the sampling depends on the ordering of the rows in a partition
[ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085005#comment-15085005 ] Brian Pasley commented on SPARK-12662: -- Users' expectation for randomSplit probably doesn't realize the disjoint sets depend on sorted data. randomSplit is used in ML pipeline to split training/validation/test sets which is common operation that doesn't assume sorted data in general. e.g.: http://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala If user misses the documentation, they may end up with overlapping train/test sets without realizing it. Can we add local sort operator or warn user there is overlap? > Add document to randomSplit to explain the sampling depends on the ordering > of the rows in a partition > -- > > Key: SPARK-12662 > URL: https://issues.apache.org/jira/browse/SPARK-12662 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Yin Huai >Assignee: Sameer Agarwal > > With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following > code will provide overlapped rows for two DFs returned by the randomSplit. > {code} > sqlContext.sql("drop table if exists test") > val x = sc.parallelize(1 to 210) > case class R(ID : Int) > sqlContext.createDataFrame(x.map > {R(_)}).write.format("json").saveAsTable("bugsc1597") > var df = sql("select distinct ID from test") > var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) > a.registerTempTable("a") > b.registerTempTable("b") > val intersectDF = a.intersect(b) > intersectDF.show > {code} > The reason is that {{sql("select distinct ID from test")} does not guarantee > the ordering rows in a partition. It will be good to add more document to the > api doc to explain it. To make intersectDF contain 0 row, the df needs to > have fixed row ordering within a partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12662) Add document to randomSplit to explain the sampling depends on the ordering of the rows in a partition
[ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15084109#comment-15084109 ] Reynold Xin commented on SPARK-12662: - Seems like that should be the user 's choice? We can improve documentation. > Add document to randomSplit to explain the sampling depends on the ordering > of the rows in a partition > -- > > Key: SPARK-12662 > URL: https://issues.apache.org/jira/browse/SPARK-12662 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Yin Huai > > With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following > code will provide overlapped rows for two DFs returned by the randomSplit. > {code} > sqlContext.sql("drop table if exists test") > val x = sc.parallelize(1 to 210) > case class R(ID : Int) > sqlContext.createDataFrame(x.map > {R(_)}).write.format("json").saveAsTable("bugsc1597") > var df = sql("select distinct ID from test") > var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) > a.registerTempTable("a") > b.registerTempTable("b") > val intersectDF = a.intersect(b) > intersectDF.show > {code} > The reason is that {{sql("select distinct ID from test")} does not guarantee > the ordering rows in a partition. It will be good to add more document to the > api doc to explain it. To make intersectDF contain 0 row, the df needs to > have fixed row ordering within a partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12662) Add document to randomSplit to explain the sampling depends on the ordering of the rows in a partition
[ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15084106#comment-15084106 ] Yin Huai commented on SPARK-12662: -- Another option is to always add local sort operator to make sure the row ordering is deterministic. [~davies] [~rxin] what do you think? > Add document to randomSplit to explain the sampling depends on the ordering > of the rows in a partition > -- > > Key: SPARK-12662 > URL: https://issues.apache.org/jira/browse/SPARK-12662 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Yin Huai > > With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following > code will provide overlapped rows for two DFs returned by the randomSplit. > {code} > sqlContext.sql("drop table if exists test") > val x = sc.parallelize(1 to 210) > case class R(ID : Int) > sqlContext.createDataFrame(x.map > {R(_)}).write.format("json").saveAsTable("bugsc1597") > var df = sql("select distinct ID from test") > var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) > a.registerTempTable("a") > b.registerTempTable("b") > val intersectDF = a.intersect(b) > intersectDF.show > {code} > The reason is that {{sql("select distinct ID from test")} does not guarantee > the ordering rows in a partition. It will be good to add more document to the > api doc to explain it. To make intersectDF contain 0 row, the df needs to > have fixed row ordering within a partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12662) Add document to randomSplit to explain the sampling depends on the ordering of the rows in a partition
[ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15084429#comment-15084429 ] Yin Huai commented on SPARK-12662: -- OK. Let's use this jira to track the work of adding document. > Add document to randomSplit to explain the sampling depends on the ordering > of the rows in a partition > -- > > Key: SPARK-12662 > URL: https://issues.apache.org/jira/browse/SPARK-12662 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Yin Huai > > With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following > code will provide overlapped rows for two DFs returned by the randomSplit. > {code} > sqlContext.sql("drop table if exists test") > val x = sc.parallelize(1 to 210) > case class R(ID : Int) > sqlContext.createDataFrame(x.map > {R(_)}).write.format("json").saveAsTable("bugsc1597") > var df = sql("select distinct ID from test") > var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) > a.registerTempTable("a") > b.registerTempTable("b") > val intersectDF = a.intersect(b) > intersectDF.show > {code} > The reason is that {{sql("select distinct ID from test")} does not guarantee > the ordering rows in a partition. It will be good to add more document to the > api doc to explain it. To make intersectDF contain 0 row, the df needs to > have fixed row ordering within a partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org