[jira] [Commented] (SPARK-5737) Scanning duplicate columns from parquet table

2015-10-25 Thread Kevin Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973523#comment-14973523
 ] 

Kevin Jung commented on SPARK-5737:
---

Based on your comment, It must be marked as resolved. Thanks.

> Scanning duplicate columns from parquet table
> -
>
> Key: SPARK-5737
> URL: https://issues.apache.org/jira/browse/SPARK-5737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Kevin Jung
> Fix For: 1.5.1
>
>
> {quote}
> import org.apache.spark.sql._
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> val rdd = sqlContext.parquetFile("temp.parquet")
> rdd.select('d1,'d1,'d2,'d2).take(3).foreach(println)
> {quote}
> The results of above code have null values at the preceding columns of 
> duplicate two.
> For example,
> {quote}
> [null,-5.7,null,121.05]
> [null,-61.17,null,108.91]
> [null,50.60,null,72.15]
> {quote}
> This happens only in ParquetTableScan. PysicalRDD works fine and the rows 
> have duplicate values like...
> {quote}
> [-5.7,-5.7,121.05,121.05]
> [-61.17,-61.17,108.91,108.91]
> [50.60,50.60,72.15,72.15]
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5737) Scanning duplicate columns from parquet table

2015-10-25 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung resolved SPARK-5737.
---
   Resolution: Fixed
Fix Version/s: 1.5.1

> Scanning duplicate columns from parquet table
> -
>
> Key: SPARK-5737
> URL: https://issues.apache.org/jira/browse/SPARK-5737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Kevin Jung
> Fix For: 1.5.1
>
>
> {quote}
> import org.apache.spark.sql._
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> val rdd = sqlContext.parquetFile("temp.parquet")
> rdd.select('d1,'d1,'d2,'d2).take(3).foreach(println)
> {quote}
> The results of above code have null values at the preceding columns of 
> duplicate two.
> For example,
> {quote}
> [null,-5.7,null,121.05]
> [null,-61.17,null,108.91]
> [null,50.60,null,72.15]
> {quote}
> This happens only in ParquetTableScan. PysicalRDD works fine and the rows 
> have duplicate values like...
> {quote}
> [-5.7,-5.7,121.05,121.05]
> [-61.17,-61.17,108.91,108.91]
> [50.60,50.60,72.15,72.15]
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10039) Resetting REPL state does not work

2015-08-31 Thread Kevin Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723180#comment-14723180
 ] 

Kevin Jung commented on SPARK-10039:


I quickly look over scala github to investigate it. Scala has not only 
PlainFile but also PlainDirectory so I think virtualDirectory must be a 
instance of PlainDirectory. However, PlainDirectory extends PlainFile so that 
its create() method which Spark uses always create a file. It seems like a bug 
in Scala so I close this issue.

> Resetting REPL state does not work
> --
>
> Key: SPARK-10039
> URL: https://issues.apache.org/jira/browse/SPARK-10039
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.4.1
>Reporter: Kevin Jung
>Priority: Minor
>
> Spark shell can't find a base directory of class server after running 
> ":reset" command. 
> {quote}
> scala> :reset 
> scala> 1 
> uncaught exception during compilation: java.lang.AssertiON-ERROR 
> java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in 
> '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory 
> ~~~impossible to command anymore including 'exit'~~~ 
> {quote}
> I figure out reset() method in SparkIMain try to delete virtualDirectory and 
> then create again. But virtualDirectory.create() makes a file, not a 
> directory. Details here.
> {quote}
> drwxrwxr-x.   3 root root0 2015-08-17 09:09 
> spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
> (After :reset)
> \-rw-rw-r--.   1 root root0 2015-08-17 09:09 
> spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
> {quote}
> "vd.delete; vd.givenPath.createDirectory(true);" will temporarily solve the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10039) Resetting REPL state does not work

2015-08-31 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung closed SPARK-10039.
--
Resolution: Invalid

> Resetting REPL state does not work
> --
>
> Key: SPARK-10039
> URL: https://issues.apache.org/jira/browse/SPARK-10039
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.4.1
>Reporter: Kevin Jung
>Priority: Minor
>
> Spark shell can't find a base directory of class server after running 
> ":reset" command. 
> {quote}
> scala> :reset 
> scala> 1 
> uncaught exception during compilation: java.lang.AssertiON-ERROR 
> java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in 
> '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory 
> ~~~impossible to command anymore including 'exit'~~~ 
> {quote}
> I figure out reset() method in SparkIMain try to delete virtualDirectory and 
> then create again. But virtualDirectory.create() makes a file, not a 
> directory. Details here.
> {quote}
> drwxrwxr-x.   3 root root0 2015-08-17 09:09 
> spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
> (After :reset)
> \-rw-rw-r--.   1 root root0 2015-08-17 09:09 
> spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
> {quote}
> "vd.delete; vd.givenPath.createDirectory(true);" will temporarily solve the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10039) Resetting REPL state does not work

2015-08-27 Thread Kevin Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716265#comment-14716265
 ] 

Kevin Jung commented on SPARK-10039:


The code suggested is from line #986 and #987 of SparkIMain.reset().
You can find it from 
https://github.com/apache/spark/blob/branch-1.4/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkIMain.scala

 Resetting REPL state does not work
 --

 Key: SPARK-10039
 URL: https://issues.apache.org/jira/browse/SPARK-10039
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.4.1
Reporter: Kevin Jung
Priority: Minor

 Spark shell can't find a base directory of class server after running 
 :reset command. 
 {quote}
 scala :reset 
 scala 1 
 uncaught exception during compilation: java.lang.AssertiON-ERROR 
 java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in 
 '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory 
 ~~~impossible to command anymore including 'exit'~~~ 
 {quote}
 I figure out reset() method in SparkIMain try to delete virtualDirectory and 
 then create again. But virtualDirectory.create() makes a file, not a 
 directory. Details here.
 {quote}
 drwxrwxr-x.   3 root root0 2015-08-17 09:09 
 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
 (After :reset)
 \-rw-rw-r--.   1 root root0 2015-08-17 09:09 
 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
 {quote}
 vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the 
 problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10039) Resetting REPL state does not work

2015-08-26 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung updated SPARK-10039:
---
Summary: Resetting REPL state does not work  (was: Resetting REPL state not 
work)

 Resetting REPL state does not work
 --

 Key: SPARK-10039
 URL: https://issues.apache.org/jira/browse/SPARK-10039
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.4.1
Reporter: Kevin Jung
Priority: Minor

 Spark shell can't find a base directory of class server after running 
 :reset command. 
 {quote}
 scala :reset 
 scala 1 
 uncaught exception during compilation: java.lang.AssertiON-ERROR 
 java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in 
 '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory 
 ~~~impossible to command anymore including 'exit'~~~ 
 {quote}
 I figure out reset() method in SparkIMain try to delete virtualDirectory and 
 then create again. But virtualDirectory.create() makes a file, not a 
 directory. Details here.
 {quote}
 drwxrwxr-x.   3 root root0 2015-08-17 09:09 
 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
 (After :reset)
 \-rw-rw-r--.   1 root root0 2015-08-17 09:09 
 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
 {quote}
 vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the 
 problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10039) Resetting REPL state not work

2015-08-16 Thread Kevin Jung (JIRA)
Kevin Jung created SPARK-10039:
--

 Summary: Resetting REPL state not work
 Key: SPARK-10039
 URL: https://issues.apache.org/jira/browse/SPARK-10039
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.4.1
Reporter: Kevin Jung


Spark shell can't find a base directory of class server after running :reset 
command. 

scala :reset 
scala 1 
uncaught exception during compilation: java.lang.AssertiON-ERROR 
java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in 
'/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory 
~~~impossible to command anymore~~~ 

I figure out reset() method in SparkIMain try to delete virtualDirectory and 
then create again. But virtualDirectory.create() makes a file, not a directory. 
Details here.

-
drwxrwxr-x.   3 root root0 2015-08-17 09:09 
spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
(After :reset)
-rw-rw-r--.   1 root root0 2015-08-17 09:09 
spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
-

vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the 
problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10039) Resetting REPL state not work

2015-08-16 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung updated SPARK-10039:
---
Description: 
Spark shell can't find a base directory of class server after running :reset 
command. 
{quote}
scala :reset 
scala 1 
uncaught exception during compilation: java.lang.AssertiON-ERROR 
java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in 
'/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory 
~~~impossible to command anymore including 'exit'~~~ 
{quote}
I figure out reset() method in SparkIMain try to delete virtualDirectory and 
then create again. But virtualDirectory.create() makes a file, not a directory. 
Details here.
{quote}
drwxrwxr-x.   3 root root0 2015-08-17 09:09 
spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
(After :reset)
\-rw-rw-r--.   1 root root0 2015-08-17 09:09 
spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
{quote}
vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the 
problem.

  was:
Spark shell can't find a base directory of class server after running :reset 
command. 

scala :reset 
scala 1 
uncaught exception during compilation: java.lang.AssertiON-ERROR 
java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in 
'/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory 
~~~impossible to command anymore~~~ 

I figure out reset() method in SparkIMain try to delete virtualDirectory and 
then create again. But virtualDirectory.create() makes a file, not a directory. 
Details here.

-
drwxrwxr-x.   3 root root0 2015-08-17 09:09 
spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
(After :reset)
-rw-rw-r--.   1 root root0 2015-08-17 09:09 
spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
-

vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the 
problem.


 Resetting REPL state not work
 -

 Key: SPARK-10039
 URL: https://issues.apache.org/jira/browse/SPARK-10039
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.4.1
Reporter: Kevin Jung

 Spark shell can't find a base directory of class server after running 
 :reset command. 
 {quote}
 scala :reset 
 scala 1 
 uncaught exception during compilation: java.lang.AssertiON-ERROR 
 java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in 
 '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory 
 ~~~impossible to command anymore including 'exit'~~~ 
 {quote}
 I figure out reset() method in SparkIMain try to delete virtualDirectory and 
 then create again. But virtualDirectory.create() makes a file, not a 
 directory. Details here.
 {quote}
 drwxrwxr-x.   3 root root0 2015-08-17 09:09 
 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
 (After :reset)
 \-rw-rw-r--.   1 root root0 2015-08-17 09:09 
 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
 {quote}
 vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the 
 problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10039) Resetting REPL state not work

2015-08-16 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung updated SPARK-10039:
---
Priority: Minor  (was: Major)

 Resetting REPL state not work
 -

 Key: SPARK-10039
 URL: https://issues.apache.org/jira/browse/SPARK-10039
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.4.1
Reporter: Kevin Jung
Priority: Minor

 Spark shell can't find a base directory of class server after running 
 :reset command. 
 {quote}
 scala :reset 
 scala 1 
 uncaught exception during compilation: java.lang.AssertiON-ERROR 
 java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in 
 '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory 
 ~~~impossible to command anymore including 'exit'~~~ 
 {quote}
 I figure out reset() method in SparkIMain try to delete virtualDirectory and 
 then create again. But virtualDirectory.create() makes a file, not a 
 directory. Details here.
 {quote}
 drwxrwxr-x.   3 root root0 2015-08-17 09:09 
 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
 (After :reset)
 \-rw-rw-r--.   1 root root0 2015-08-17 09:09 
 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017
 {quote}
 vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the 
 problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5737) Scanning duplicate columns from parquet table

2015-02-11 Thread Kevin Jung (JIRA)
Kevin Jung created SPARK-5737:
-

 Summary: Scanning duplicate columns from parquet table
 Key: SPARK-5737
 URL: https://issues.apache.org/jira/browse/SPARK-5737
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Kevin Jung


{quote}
import org.apache.spark.sql._
val sqlContext = new SQLContext(sc)
import sqlContext._
val rdd = sqlContext.parquetFile(temp.parquet)
rdd.select('d1,'d1,'d2,'d2).take(3).foreach(println)
{quote}

The results of above code have null values at the preceding columns of 
duplicate two.
For example,

{quote}
[null,-5.7,null,121.05]
[null,-61.17,null,108.91]
[null,50.60,null,72.15]
{quote}

This happens only in ParquetTableScan. PysicalRDD works fine and the rows have 
duplicate values like...

{quote}
[-5.7,-5.7,121.05,121.05]
[-61.17,-61.17,108.91,108.91]
[50.60,50.60,72.15,72.15]
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-10 Thread Kevin Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315341#comment-14315341
 ] 

Kevin Jung commented on SPARK-5081:
---

Xuefeng Wu mentioned about one difference of snappy version.

dependency
groupIdorg.xerial.snappy/groupId
artifactIdsnappy-java/artifactId
version1.0.5.3/version 
/dependency
It is changed to 1.1.1.6 in spark 1.2. We need to consider these two.



 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5081) Shuffle write increases

2015-02-10 Thread Kevin Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308620#comment-14308620
 ] 

Kevin Jung edited comment on SPARK-5081 at 2/11/15 12:56 AM:
-

To test under the same condition, I set this to snappy for all spark version 
but this problem occurs. AFAIK, lz4 needs more CPU time than snappy but it has 
better compression ratio.


was (Author: kallsu):
To test under the same condition, I set this to snappy for all spark version 
but this problem occurs. AFA I know, lz4 needs more CPU time than snappy but it 
has better compression ratio.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-05 Thread Kevin Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308608#comment-14308608
 ] 

Kevin Jung commented on SPARK-5081:
---

Sorry, I will make an effort to provide another code to replay this problem 
because I don't have the old code anymore.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-05 Thread Kevin Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308620#comment-14308620
 ] 

Kevin Jung commented on SPARK-5081:
---

To test under the same condition, I set this to snappy for all spark version 
but this problem occurs. AFA I know, lz4 needs more CPU time than snappy but it 
has better compression ratio.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5081) Shuffle write increases

2015-01-04 Thread Kevin Jung (JIRA)
Kevin Jung created SPARK-5081:
-

 Summary: Shuffle write increases
 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung


The size of shuffle write showing in spark web UI is much different when I 
execute same spark job with same input data in both spark 1.1 and spark 
1.2. 
At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 
but 146.9MB in spark 1.2. 
I set spark.shuffle.manager option to hash because it's default value is 
changed but spark 1.2 still writes shuffle output more than spark 1.1.
It can increase disk I/O overhead exponentially as the input file gets bigger.
In the case of about 100GB input, for example, the size of shuffle write is 
39.7GB in spark 1.1 but 91.0GB in spark 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5081) Shuffle write increases

2015-01-04 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung updated SPARK-5081:
--
Description: 
The size of shuffle write showing in spark web UI is much different when I 
execute same spark job with same input data in both spark 1.1 and spark 1.2. 
At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
in spark 1.2. 
I set spark.shuffle.manager option to hash because it's default value is 
changed but spark 1.2 still writes shuffle output more than spark 1.1.
It can increase disk I/O overhead exponentially as the input file gets bigger 
and it causes the jobs take more time to complete. 
In the case of about 100GB input, for example, the size of shuffle write is 
39.7GB in spark 1.1 but 91.0GB in spark 1.2.

spark 1.1
||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
|9|saveAsTextFile| |1169.4KB| |
|12|combineByKey| |1265.4KB|1275.0KB|
|6|sortByKey| |1276.5KB| |
|8|mapPartitions| |91.0MB|1383.1KB|
|4|apply| |89.4MB| |
|5|sortBy|155.6MB| |98.1MB|
|3|sortBy|155.6MB| | |
|1|collect| |2.1MB| |
|2|mapValues|155.6MB| |2.2MB|
|0|first|184.4KB| | |

spark 1.2
||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
|12|saveAsTextFile| |1170.2KB| |
|11|combineByKey| |1264.5KB|1275.0KB|
|8|sortByKey| |1273.6KB| |
|7|mapPartitions| |134.5MB|1383.1KB|
|5|zipWithIndex| |132.5MB| |
|4|sortBy|155.6MB| |146.9MB|
|3|sortBy|155.6MB| | |
|2|collect| |2.0MB| |
|1|mapValues|155.6MB| |2.2MB|
|0|first|184.4KB| | |

  was:
The size of shuffle write showing in spark web UI is much different when I 
execute same spark job with same input data in both spark 1.1 and spark 1.2. 
At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
in spark 1.2. 
I set spark.shuffle.manager option to hash because it's default value is 
changed but spark 1.2 still writes shuffle output more than spark 1.1.
It can increase disk I/O overhead exponentially as the input file gets bigger.
In the case of about 100GB input, for example, the size of shuffle write is 
39.7GB in spark 1.1 but 91.0GB in spark 1.2.


 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5081) Shuffle write increases

2015-01-04 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung updated SPARK-5081:
--
Description: 
The size of shuffle write showing in spark web UI is much different when I 
execute same spark job with same input data in both spark 1.1 and spark 1.2. 
At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
in spark 1.2. 
I set spark.shuffle.manager option to hash because it's default value is 
changed but spark 1.2 still writes shuffle output more than spark 1.1.
It can increase disk I/O overhead exponentially as the input file gets bigger.
In the case of about 100GB input, for example, the size of shuffle write is 
39.7GB in spark 1.1 but 91.0GB in spark 1.2.

  was:
The size of shuffle write showing in spark web UI is much different when I 
execute same spark job with same input data in both spark 1.1 and spark 
1.2. 
At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 
but 146.9MB in spark 1.2. 
I set spark.shuffle.manager option to hash because it's default value is 
changed but spark 1.2 still writes shuffle output more than spark 1.1.
It can increase disk I/O overhead exponentially as the input file gets bigger.
In the case of about 100GB input, for example, the size of shuffle write is 
39.7GB in spark 1.1 but 91.0GB in spark 1.2.


 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger.
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3364) Zip equal-length but unequally-partition

2014-09-03 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung updated SPARK-3364:
--
Fix Version/s: 1.1.0

 Zip equal-length but unequally-partition
 

 Key: SPARK-3364
 URL: https://issues.apache.org/jira/browse/SPARK-3364
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Kevin Jung
 Fix For: 1.1.0


 ZippedRDD losts some elements after zipping RDDs with equal numbers of 
 partitions but unequal numbers of elements in their each partitions.
 This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) 
 with physically unbalanced HDFS file.
 {noformat}
 var x = sc.parallelize(1 to 9,3)
 var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i)
 var z = y.partitionBy(new RangePartitioner(3,y))
 expected
 x.zip(y).count()
 9
 x.zip(y).collect()
 Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), 
 (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3)))
 unexpected
 x.zip(z).count()
 7
 x.zip(z).collect()
 Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), 
 (5,(2,2)), (7,(3,3)), (8,(3,3)))
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3364) zip equal-length but unequally-partition

2014-09-02 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung updated SPARK-3364:
--
Description: 
ZippedRDD losts some elements after zipping RDDs with equal numbers of 
partitions but unequal numbers of elements in their each partitions.
This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) 
with physically unbalanced HDFS file.

{noformat}
var x = sc.parallelize(1 to 9,3)
var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i)
var z = y.partitionBy(new RangePartitioner(3,y))

expected
x.zip(y).count()
9
x.zip(y).collect()
Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), 
(5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3)))

unexpected
x.zip(z).count()
7
x.zip(z).collect()
Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), 
(5,(2,2)), (7,(3,3)), (8,(3,3)))
{noformat}

  was:
ZippedRDD losts some elements after zipping RDDs with equal numbers of 
partitions but unequal numbers of elements in their each partitions.
This can happen when a user create RDD by sc.textFile(path,partitionNumbers) 
with physically unbalanced HDFS file.

{noformat}
var x = sc.parallelize(1 to 9,3)
var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i)
var z = y.partitionBy(new RangePartitioner(3,y))

expected
x.zip(y).count()
9
x.zip(y).collect()
Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), 
(5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3)))

unexpected
x.zip(z).count()
7
x.zip(z).collect()
Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), 
(5,(2,2)), (7,(3,3)), (8,(3,3)))
{noformat}


 zip equal-length but unequally-partition
 

 Key: SPARK-3364
 URL: https://issues.apache.org/jira/browse/SPARK-3364
 Project: Spark
  Issue Type: Bug
Reporter: Kevin Jung
Priority: Critical

 ZippedRDD losts some elements after zipping RDDs with equal numbers of 
 partitions but unequal numbers of elements in their each partitions.
 This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) 
 with physically unbalanced HDFS file.
 {noformat}
 var x = sc.parallelize(1 to 9,3)
 var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i)
 var z = y.partitionBy(new RangePartitioner(3,y))
 expected
 x.zip(y).count()
 9
 x.zip(y).collect()
 Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), 
 (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3)))
 unexpected
 x.zip(z).count()
 7
 x.zip(z).collect()
 Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), 
 (5,(2,2)), (7,(3,3)), (8,(3,3)))
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3364) zip equal-length but unequally-partition

2014-09-02 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung updated SPARK-3364:
--
Priority: Major  (was: Critical)

 zip equal-length but unequally-partition
 

 Key: SPARK-3364
 URL: https://issues.apache.org/jira/browse/SPARK-3364
 Project: Spark
  Issue Type: Bug
Reporter: Kevin Jung

 ZippedRDD losts some elements after zipping RDDs with equal numbers of 
 partitions but unequal numbers of elements in their each partitions.
 This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) 
 with physically unbalanced HDFS file.
 {noformat}
 var x = sc.parallelize(1 to 9,3)
 var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i)
 var z = y.partitionBy(new RangePartitioner(3,y))
 expected
 x.zip(y).count()
 9
 x.zip(y).collect()
 Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), 
 (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3)))
 unexpected
 x.zip(z).count()
 7
 x.zip(z).collect()
 Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), 
 (5,(2,2)), (7,(3,3)), (8,(3,3)))
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3364) Zip equal-length but unequally-partition

2014-09-02 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung updated SPARK-3364:
--
Summary: Zip equal-length but unequally-partition  (was: zip equal-length 
but unequally-partition)

 Zip equal-length but unequally-partition
 

 Key: SPARK-3364
 URL: https://issues.apache.org/jira/browse/SPARK-3364
 Project: Spark
  Issue Type: Bug
Reporter: Kevin Jung

 ZippedRDD losts some elements after zipping RDDs with equal numbers of 
 partitions but unequal numbers of elements in their each partitions.
 This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) 
 with physically unbalanced HDFS file.
 {noformat}
 var x = sc.parallelize(1 to 9,3)
 var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i)
 var z = y.partitionBy(new RangePartitioner(3,y))
 expected
 x.zip(y).count()
 9
 x.zip(y).collect()
 Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), 
 (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3)))
 unexpected
 x.zip(z).count()
 7
 x.zip(z).collect()
 Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), 
 (5,(2,2)), (7,(3,3)), (8,(3,3)))
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3364) Zip equal-length but unequally-partition

2014-09-02 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung updated SPARK-3364:
--
Affects Version/s: (was: 1.0.2)

 Zip equal-length but unequally-partition
 

 Key: SPARK-3364
 URL: https://issues.apache.org/jira/browse/SPARK-3364
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Kevin Jung

 ZippedRDD losts some elements after zipping RDDs with equal numbers of 
 partitions but unequal numbers of elements in their each partitions.
 This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) 
 with physically unbalanced HDFS file.
 {noformat}
 var x = sc.parallelize(1 to 9,3)
 var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i)
 var z = y.partitionBy(new RangePartitioner(3,y))
 expected
 x.zip(y).count()
 9
 x.zip(y).collect()
 Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), 
 (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3)))
 unexpected
 x.zip(z).count()
 7
 x.zip(z).collect()
 Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), 
 (5,(2,2)), (7,(3,3)), (8,(3,3)))
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org