[jira] [Commented] (SPARK-4019) Repartitioning with more than 2000 partitions may drop all data when partitions are mostly empty or cause deserialization errors if at least one partition is empty

2014-10-22 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180737#comment-14180737
 ] 

Josh Rosen commented on SPARK-4019:
---

This also explains another occurrence of the Snappy PARSING_ERROR(2) error.  If 
the average block size is non-zero but there is at least one zero-sized block, 
then HighlyCompressedMapStatus would cause us to fetch empty blocks, leading to 
the PARSING_ERROR(2) when Snappy tries to decompress this empty block.

Thanks to [~ilikerps] for helping to figure this out.

 Repartitioning with more than 2000 partitions may drop all data when 
 partitions are mostly empty or cause deserialization errors if at least one 
 partition is empty
 ---

 Key: SPARK-4019
 URL: https://issues.apache.org/jira/browse/SPARK-4019
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Josh Rosen
Priority: Blocker

 {code}
 sc.makeRDD(0 until 10, 1000).repartition(2001).collect()
 {code}
 returns `Array()`.
 1.1.0 doesn't have this issue. Tried both HASH and SORT manager.
 This problem can also manifest itself as Snappy deserialization errors if the 
 average map output status size is non-zero but there is at least one empty 
 partition, e.g. 
 sc.makeRDD(0 until 10, 1000).repartition(2001).collect()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4019) Repartitioning with more than 2000 partitions may drop all data when partitions are mostly empty.

2014-10-21 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178056#comment-14178056
 ] 

Patrick Wendell commented on SPARK-4019:


Great work getting to the root cause of this [~joshrosen] - this was a tricky 
issue.

 Repartitioning with more than 2000 partitions may drop all data when 
 partitions are mostly empty.
 -

 Key: SPARK-4019
 URL: https://issues.apache.org/jira/browse/SPARK-4019
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Josh Rosen
Priority: Blocker

 {code}
 sc.makeRDD(0 until 10, 1000).repartition(2001).collect()
 {code}
 returns `Array()`.
 1.1.0 doesn't have this issue. Tried both HASH and SORT manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4019) Repartitioning with more than 2000 partitions may drop all data when partitions are mostly empty.

2014-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177809#comment-14177809
 ] 

Apache Spark commented on SPARK-4019:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2866

 Repartitioning with more than 2000 partitions may drop all data when 
 partitions are mostly empty.
 -

 Key: SPARK-4019
 URL: https://issues.apache.org/jira/browse/SPARK-4019
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Josh Rosen
Priority: Blocker

 {code}
 sc.makeRDD(0 until 10, 1000).repartition(2001).collect()
 {code}
 returns `Array()`.
 1.1.0 doesn't have this issue. Tried both HASH and SORT manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org