[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990795#comment-14990795 ] Reynold Xin commented on SPARK-11303: - This made it into 1.5.2. > sample (without replacement) + filter returns wrong results in DataFrame > > > Key: SPARK-11303 > URL: https://issues.apache.org/jira/browse/SPARK-11303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: pyspark local mode, linux. >Reporter: Yuval Tanny >Assignee: Yanbo Liang > Fix For: 1.5.2, 1.6.0 > > > When sampling and then filtering DataFrame from python, we get inconsistent > result when not caching the sampled DataFrame. This bug doesn't appear in > spark 1.4.1. > {code} > d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t']) > d_sampled = d.sample(False, 0.1, 1) > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > d_sampled.cache() > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > {code} > output: > {code} > 14 > 7 > 8 > 14 > 7 > 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978406#comment-14978406 ] Michael Armbrust commented on SPARK-11303: -- I picked it into branch-1.5, but I'm not sure if it made the cut off. [~rxin]? > sample (without replacement) + filter returns wrong results in DataFrame > > > Key: SPARK-11303 > URL: https://issues.apache.org/jira/browse/SPARK-11303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: pyspark local mode, linux. >Reporter: Yuval Tanny > Fix For: 1.6.0 > > > When sampling and then filtering DataFrame from python, we get inconsistent > result when not caching the sampled DataFrame. This bug doesn't appear in > spark 1.4.1. > {code} > d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t']) > d_sampled = d.sample(False, 0.1, 1) > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > d_sampled.cache() > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > {code} > output: > {code} > 14 > 7 > 8 > 14 > 7 > 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975918#comment-14975918 ] Apache Spark commented on SPARK-11303: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/9294 > sample (without replacement) + filter returns wrong results in DataFrame > > > Key: SPARK-11303 > URL: https://issues.apache.org/jira/browse/SPARK-11303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: pyspark local mode, linux. >Reporter: Yuval Tanny > > When sampling and then filtering DataFrame from python, we get inconsistent > result when not caching the sampled DataFrame. This bug doesn't appear in > spark 1.4.1. > d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t']) > d_sampled = d.sample(False, 0.1, 1) > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > d_sampled.cache() > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > output: > 14 > 7 > 8 > 14 > 7 > 7 > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976317#comment-14976317 ] Yuval Tanny commented on SPARK-11303: - Is the fix is going to be merged to 1.5 (and 1.5.2)? Thanks > sample (without replacement) + filter returns wrong results in DataFrame > > > Key: SPARK-11303 > URL: https://issues.apache.org/jira/browse/SPARK-11303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: pyspark local mode, linux. >Reporter: Yuval Tanny > Fix For: 1.6.0 > > > When sampling and then filtering DataFrame from python, we get inconsistent > result when not caching the sampled DataFrame. This bug doesn't appear in > spark 1.4.1. > {code} > d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t']) > d_sampled = d.sample(False, 0.1, 1) > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > d_sampled.cache() > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > {code} > output: > {code} > 14 > 7 > 8 > 14 > 7 > 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973993#comment-14973993 ] Yanbo Liang commented on SPARK-11303: - It looks like this bug caused by mutable row copy related problem similar with SPARK-4963. But after adding *copy* to *sample*, it still can not resolve this issue. I found *map(_copy())* was removed by https://github.com/apache/spark/pull/8040/files, [~rxin] Could you tell us the motivation of removing *map(_copy())* in that PR? > sample (without replacement) + filter returns wrong results in DataFrame > > > Key: SPARK-11303 > URL: https://issues.apache.org/jira/browse/SPARK-11303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: pyspark local mode, linux. >Reporter: Yuval Tanny > > When sampling and then filtering DataFrame from python, we get inconsistent > result when not caching the sampled DataFrame. This bug doesn't appear in > spark 1.4.1. > d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t']) > d_sampled = d.sample(False, 0.1, 1) > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > d_sampled.cache() > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > output: > 14 > 7 > 8 > 14 > 7 > 7 > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973896#comment-14973896 ] Yanbo Liang commented on SPARK-11303: - I think the reason of this bug is the same as SPARK-4963, I will send a PR to resolve it. > sample (without replacement) + filter returns wrong results in DataFrame > > > Key: SPARK-11303 > URL: https://issues.apache.org/jira/browse/SPARK-11303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: pyspark local mode, linux. >Reporter: Yuval Tanny > > When sampling and then filtering DataFrame from python, we get inconsistent > result when not caching the sampled DataFrame. This bug doesn't appear in > spark 1.4.1. > d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t']) > d_sampled = d.sample(False, 0.1, 1) > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > d_sampled.cache() > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > output: > 14 > 7 > 8 > 14 > 7 > 7 > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org