[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15715474#comment-15715474 ] Apache Spark commented on SPARK-16589: -- User 'aray' has created a pull request for this issue: https://github.com/apache/spark/pull/16121 > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > Labels: correctness > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > rdd.cartesian(rdd).cartesian(rdd).distinct().count() > ## 251 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > ## 1000 > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > ## 1000 > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545239#comment-15545239 ] Maciej Szymkiewicz commented on SPARK-16589: Not actively so if you want to give it a shot go ahead. I investigated a little bit deeper and tried to fix it closer to the sources but it ended in a hell full of special cases which makes me think that we should never expose data requiring `CartesianDeserializer` directly (there is also a SPARK-16589). > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > rdd.cartesian(rdd).cartesian(rdd).distinct().count() > ## 251 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > ## 1000 > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > ## 1000 > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543349#comment-15543349 ] holdenk commented on SPARK-16589: - Is this something you are still investigating/working on actively? > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > rdd.cartesian(rdd).cartesian(rdd).distinct().count() > ## 251 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > ## 1000 > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > ## 1000 > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391108#comment-15391108 ] Maciej Szymkiewicz commented on SPARK-16589: [~holdenk] Makes sense. I was thinking more about design than other possible issues but it is probably better safe than sorry. It still should be fixed as fast as possible. It is really ugly bug and is easy to miss. I doubt there are many legitimate cases when one can do something like this though (I guess this is why it hasn't been reported before). > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > rdd.cartesian(rdd).cartesian(rdd).distinct().count() > ## 251 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > ## 1000 > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > ## 1000 > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390299#comment-15390299 ] holdenk commented on SPARK-16589: - Yah I think we should explore whats going on a bit more detail here - serializing it seems like we might just be hiding something which is generally broken. I could be wrong though but I think this needs a bit more investigation before going to the PR stage. > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > rdd.cartesian(rdd).cartesian(rdd).distinct().count() > ## 251 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > ## 1000 > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > ## 1000 > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384235#comment-15384235 ] Maciej Szymkiewicz commented on SPARK-16589: Thanks [~dongjoon]. [~joshrosen] Could you take a look at this? > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382662#comment-15382662 ] Dongjoon Hyun commented on SPARK-16589: --- For me, it looks good to me. In the PR, you can get advices faster if you ask review to `JoshRosen`. > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382620#comment-15382620 ] Apache Spark commented on SPARK-16589: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/14248 > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382614#comment-15382614 ] Maciej Szymkiewicz commented on SPARK-16589: [~dongjoon] I'll work on that but I am not exactly sure what is the best approach here. If feel like my current fix is slightly suboptimal. > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381557#comment-15381557 ] Dongjoon Hyun commented on SPARK-16589: --- Oh, Indeed, there is a bug of PySpark. Could you make a PR for this? > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org