Yuval Tanny created SPARK-11303: ----------------------------------- Summary: sample (without replacement) + filter returns wrong results in DataFrame Key: SPARK-11303 URL: https://issues.apache.org/jira/browse/SPARK-11303 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Environment: pyspark local mode, linux. Reporter: Yuval Tanny
When sampling and then filtering DataFrame from python, we get inconsistent result when not caching the sampled DataFrame. This bug doesn't appear in spark 1.4.1. d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t']) d_sampled = d.sample(False, 0.1, 1) print d_sampled.count() print d_sampled.filter('t = 1').count() print d_sampled.filter('t != 1').count() d_sampled.cache() print d_sampled.count() print d_sampled.filter('t = 1').count() print d_sampled.filter('t != 1').count() output: 14 7 8 14 7 7 Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org