[jira] [Commented] (SPARK-5737) Scanning duplicate columns from parquet table
[ https://issues.apache.org/jira/browse/SPARK-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973523#comment-14973523 ] Kevin Jung commented on SPARK-5737: --- Based on your comment, It must be marked as resolved. Thanks. > Scanning duplicate columns from parquet table > - > > Key: SPARK-5737 > URL: https://issues.apache.org/jira/browse/SPARK-5737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Kevin Jung > Fix For: 1.5.1 > > > {quote} > import org.apache.spark.sql._ > val sqlContext = new SQLContext(sc) > import sqlContext._ > val rdd = sqlContext.parquetFile("temp.parquet") > rdd.select('d1,'d1,'d2,'d2).take(3).foreach(println) > {quote} > The results of above code have null values at the preceding columns of > duplicate two. > For example, > {quote} > [null,-5.7,null,121.05] > [null,-61.17,null,108.91] > [null,50.60,null,72.15] > {quote} > This happens only in ParquetTableScan. PysicalRDD works fine and the rows > have duplicate values like... > {quote} > [-5.7,-5.7,121.05,121.05] > [-61.17,-61.17,108.91,108.91] > [50.60,50.60,72.15,72.15] > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5737) Scanning duplicate columns from parquet table
[ https://issues.apache.org/jira/browse/SPARK-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Jung resolved SPARK-5737. --- Resolution: Fixed Fix Version/s: 1.5.1 > Scanning duplicate columns from parquet table > - > > Key: SPARK-5737 > URL: https://issues.apache.org/jira/browse/SPARK-5737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Kevin Jung > Fix For: 1.5.1 > > > {quote} > import org.apache.spark.sql._ > val sqlContext = new SQLContext(sc) > import sqlContext._ > val rdd = sqlContext.parquetFile("temp.parquet") > rdd.select('d1,'d1,'d2,'d2).take(3).foreach(println) > {quote} > The results of above code have null values at the preceding columns of > duplicate two. > For example, > {quote} > [null,-5.7,null,121.05] > [null,-61.17,null,108.91] > [null,50.60,null,72.15] > {quote} > This happens only in ParquetTableScan. PysicalRDD works fine and the rows > have duplicate values like... > {quote} > [-5.7,-5.7,121.05,121.05] > [-61.17,-61.17,108.91,108.91] > [50.60,50.60,72.15,72.15] > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10039) Resetting REPL state does not work
[ https://issues.apache.org/jira/browse/SPARK-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723180#comment-14723180 ] Kevin Jung commented on SPARK-10039: I quickly look over scala github to investigate it. Scala has not only PlainFile but also PlainDirectory so I think virtualDirectory must be a instance of PlainDirectory. However, PlainDirectory extends PlainFile so that its create() method which Spark uses always create a file. It seems like a bug in Scala so I close this issue. > Resetting REPL state does not work > -- > > Key: SPARK-10039 > URL: https://issues.apache.org/jira/browse/SPARK-10039 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.4.1 >Reporter: Kevin Jung >Priority: Minor > > Spark shell can't find a base directory of class server after running > ":reset" command. > {quote} > scala> :reset > scala> 1 > uncaught exception during compilation: java.lang.AssertiON-ERROR > java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in > '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory > ~~~impossible to command anymore including 'exit'~~~ > {quote} > I figure out reset() method in SparkIMain try to delete virtualDirectory and > then create again. But virtualDirectory.create() makes a file, not a > directory. Details here. > {quote} > drwxrwxr-x. 3 root root0 2015-08-17 09:09 > spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 > (After :reset) > \-rw-rw-r--. 1 root root0 2015-08-17 09:09 > spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 > {quote} > "vd.delete; vd.givenPath.createDirectory(true);" will temporarily solve the > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10039) Resetting REPL state does not work
[ https://issues.apache.org/jira/browse/SPARK-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Jung closed SPARK-10039. -- Resolution: Invalid > Resetting REPL state does not work > -- > > Key: SPARK-10039 > URL: https://issues.apache.org/jira/browse/SPARK-10039 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.4.1 >Reporter: Kevin Jung >Priority: Minor > > Spark shell can't find a base directory of class server after running > ":reset" command. > {quote} > scala> :reset > scala> 1 > uncaught exception during compilation: java.lang.AssertiON-ERROR > java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in > '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory > ~~~impossible to command anymore including 'exit'~~~ > {quote} > I figure out reset() method in SparkIMain try to delete virtualDirectory and > then create again. But virtualDirectory.create() makes a file, not a > directory. Details here. > {quote} > drwxrwxr-x. 3 root root0 2015-08-17 09:09 > spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 > (After :reset) > \-rw-rw-r--. 1 root root0 2015-08-17 09:09 > spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 > {quote} > "vd.delete; vd.givenPath.createDirectory(true);" will temporarily solve the > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10039) Resetting REPL state does not work
[ https://issues.apache.org/jira/browse/SPARK-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716265#comment-14716265 ] Kevin Jung commented on SPARK-10039: The code suggested is from line #986 and #987 of SparkIMain.reset(). You can find it from https://github.com/apache/spark/blob/branch-1.4/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkIMain.scala Resetting REPL state does not work -- Key: SPARK-10039 URL: https://issues.apache.org/jira/browse/SPARK-10039 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.4.1 Reporter: Kevin Jung Priority: Minor Spark shell can't find a base directory of class server after running :reset command. {quote} scala :reset scala 1 uncaught exception during compilation: java.lang.AssertiON-ERROR java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory ~~~impossible to command anymore including 'exit'~~~ {quote} I figure out reset() method in SparkIMain try to delete virtualDirectory and then create again. But virtualDirectory.create() makes a file, not a directory. Details here. {quote} drwxrwxr-x. 3 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 (After :reset) \-rw-rw-r--. 1 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 {quote} vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10039) Resetting REPL state does not work
[ https://issues.apache.org/jira/browse/SPARK-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Jung updated SPARK-10039: --- Summary: Resetting REPL state does not work (was: Resetting REPL state not work) Resetting REPL state does not work -- Key: SPARK-10039 URL: https://issues.apache.org/jira/browse/SPARK-10039 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.4.1 Reporter: Kevin Jung Priority: Minor Spark shell can't find a base directory of class server after running :reset command. {quote} scala :reset scala 1 uncaught exception during compilation: java.lang.AssertiON-ERROR java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory ~~~impossible to command anymore including 'exit'~~~ {quote} I figure out reset() method in SparkIMain try to delete virtualDirectory and then create again. But virtualDirectory.create() makes a file, not a directory. Details here. {quote} drwxrwxr-x. 3 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 (After :reset) \-rw-rw-r--. 1 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 {quote} vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10039) Resetting REPL state not work
Kevin Jung created SPARK-10039: -- Summary: Resetting REPL state not work Key: SPARK-10039 URL: https://issues.apache.org/jira/browse/SPARK-10039 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.4.1 Reporter: Kevin Jung Spark shell can't find a base directory of class server after running :reset command. scala :reset scala 1 uncaught exception during compilation: java.lang.AssertiON-ERROR java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory ~~~impossible to command anymore~~~ I figure out reset() method in SparkIMain try to delete virtualDirectory and then create again. But virtualDirectory.create() makes a file, not a directory. Details here. - drwxrwxr-x. 3 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 (After :reset) -rw-rw-r--. 1 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 - vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10039) Resetting REPL state not work
[ https://issues.apache.org/jira/browse/SPARK-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Jung updated SPARK-10039: --- Description: Spark shell can't find a base directory of class server after running :reset command. {quote} scala :reset scala 1 uncaught exception during compilation: java.lang.AssertiON-ERROR java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory ~~~impossible to command anymore including 'exit'~~~ {quote} I figure out reset() method in SparkIMain try to delete virtualDirectory and then create again. But virtualDirectory.create() makes a file, not a directory. Details here. {quote} drwxrwxr-x. 3 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 (After :reset) \-rw-rw-r--. 1 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 {quote} vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the problem. was: Spark shell can't find a base directory of class server after running :reset command. scala :reset scala 1 uncaught exception during compilation: java.lang.AssertiON-ERROR java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory ~~~impossible to command anymore~~~ I figure out reset() method in SparkIMain try to delete virtualDirectory and then create again. But virtualDirectory.create() makes a file, not a directory. Details here. - drwxrwxr-x. 3 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 (After :reset) -rw-rw-r--. 1 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 - vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the problem. Resetting REPL state not work - Key: SPARK-10039 URL: https://issues.apache.org/jira/browse/SPARK-10039 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.4.1 Reporter: Kevin Jung Spark shell can't find a base directory of class server after running :reset command. {quote} scala :reset scala 1 uncaught exception during compilation: java.lang.AssertiON-ERROR java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory ~~~impossible to command anymore including 'exit'~~~ {quote} I figure out reset() method in SparkIMain try to delete virtualDirectory and then create again. But virtualDirectory.create() makes a file, not a directory. Details here. {quote} drwxrwxr-x. 3 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 (After :reset) \-rw-rw-r--. 1 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 {quote} vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10039) Resetting REPL state not work
[ https://issues.apache.org/jira/browse/SPARK-10039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Jung updated SPARK-10039: --- Priority: Minor (was: Major) Resetting REPL state not work - Key: SPARK-10039 URL: https://issues.apache.org/jira/browse/SPARK-10039 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.4.1 Reporter: Kevin Jung Priority: Minor Spark shell can't find a base directory of class server after running :reset command. {quote} scala :reset scala 1 uncaught exception during compilation: java.lang.AssertiON-ERROR java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in '/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory ~~~impossible to command anymore including 'exit'~~~ {quote} I figure out reset() method in SparkIMain try to delete virtualDirectory and then create again. But virtualDirectory.create() makes a file, not a directory. Details here. {quote} drwxrwxr-x. 3 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 (After :reset) \-rw-rw-r--. 1 root root0 2015-08-17 09:09 spark-9cfc6b06-c902-4caf-8712-9ea63f17d017 {quote} vd.delete; vd.givenPath.createDirectory(true); will temporarily solve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5737) Scanning duplicate columns from parquet table
Kevin Jung created SPARK-5737: - Summary: Scanning duplicate columns from parquet table Key: SPARK-5737 URL: https://issues.apache.org/jira/browse/SPARK-5737 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Kevin Jung {quote} import org.apache.spark.sql._ val sqlContext = new SQLContext(sc) import sqlContext._ val rdd = sqlContext.parquetFile(temp.parquet) rdd.select('d1,'d1,'d2,'d2).take(3).foreach(println) {quote} The results of above code have null values at the preceding columns of duplicate two. For example, {quote} [null,-5.7,null,121.05] [null,-61.17,null,108.91] [null,50.60,null,72.15] {quote} This happens only in ParquetTableScan. PysicalRDD works fine and the rows have duplicate values like... {quote} [-5.7,-5.7,121.05,121.05] [-61.17,-61.17,108.91,108.91] [50.60,50.60,72.15,72.15] {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315341#comment-14315341 ] Kevin Jung commented on SPARK-5081: --- Xuefeng Wu mentioned about one difference of snappy version. dependency groupIdorg.xerial.snappy/groupId artifactIdsnappy-java/artifactId version1.0.5.3/version /dependency It is changed to 1.1.1.6 in spark 1.2. We need to consider these two. Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308620#comment-14308620 ] Kevin Jung edited comment on SPARK-5081 at 2/11/15 12:56 AM: - To test under the same condition, I set this to snappy for all spark version but this problem occurs. AFAIK, lz4 needs more CPU time than snappy but it has better compression ratio. was (Author: kallsu): To test under the same condition, I set this to snappy for all spark version but this problem occurs. AFA I know, lz4 needs more CPU time than snappy but it has better compression ratio. Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308608#comment-14308608 ] Kevin Jung commented on SPARK-5081: --- Sorry, I will make an effort to provide another code to replay this problem because I don't have the old code anymore. Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308620#comment-14308620 ] Kevin Jung commented on SPARK-5081: --- To test under the same condition, I set this to snappy for all spark version but this problem occurs. AFA I know, lz4 needs more CPU time than snappy but it has better compression ratio. Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5081) Shuffle write increases
Kevin Jung created SPARK-5081: - Summary: Shuffle write increases Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Jung updated SPARK-5081: -- Description: The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | was: The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Jung updated SPARK-5081: -- Description: The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. was: The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3364) Zip equal-length but unequally-partition
[ https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Jung updated SPARK-3364: -- Fix Version/s: 1.1.0 Zip equal-length but unequally-partition Key: SPARK-3364 URL: https://issues.apache.org/jira/browse/SPARK-3364 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Kevin Jung Fix For: 1.1.0 ZippedRDD losts some elements after zipping RDDs with equal numbers of partitions but unequal numbers of elements in their each partitions. This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) with physically unbalanced HDFS file. {noformat} var x = sc.parallelize(1 to 9,3) var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i) var z = y.partitionBy(new RangePartitioner(3,y)) expected x.zip(y).count() 9 x.zip(y).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3))) unexpected x.zip(z).count() 7 x.zip(z).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), (5,(2,2)), (7,(3,3)), (8,(3,3))) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3364) zip equal-length but unequally-partition
[ https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Jung updated SPARK-3364: -- Description: ZippedRDD losts some elements after zipping RDDs with equal numbers of partitions but unequal numbers of elements in their each partitions. This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) with physically unbalanced HDFS file. {noformat} var x = sc.parallelize(1 to 9,3) var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i) var z = y.partitionBy(new RangePartitioner(3,y)) expected x.zip(y).count() 9 x.zip(y).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3))) unexpected x.zip(z).count() 7 x.zip(z).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), (5,(2,2)), (7,(3,3)), (8,(3,3))) {noformat} was: ZippedRDD losts some elements after zipping RDDs with equal numbers of partitions but unequal numbers of elements in their each partitions. This can happen when a user create RDD by sc.textFile(path,partitionNumbers) with physically unbalanced HDFS file. {noformat} var x = sc.parallelize(1 to 9,3) var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i) var z = y.partitionBy(new RangePartitioner(3,y)) expected x.zip(y).count() 9 x.zip(y).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3))) unexpected x.zip(z).count() 7 x.zip(z).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), (5,(2,2)), (7,(3,3)), (8,(3,3))) {noformat} zip equal-length but unequally-partition Key: SPARK-3364 URL: https://issues.apache.org/jira/browse/SPARK-3364 Project: Spark Issue Type: Bug Reporter: Kevin Jung Priority: Critical ZippedRDD losts some elements after zipping RDDs with equal numbers of partitions but unequal numbers of elements in their each partitions. This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) with physically unbalanced HDFS file. {noformat} var x = sc.parallelize(1 to 9,3) var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i) var z = y.partitionBy(new RangePartitioner(3,y)) expected x.zip(y).count() 9 x.zip(y).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3))) unexpected x.zip(z).count() 7 x.zip(z).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), (5,(2,2)), (7,(3,3)), (8,(3,3))) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3364) zip equal-length but unequally-partition
[ https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Jung updated SPARK-3364: -- Priority: Major (was: Critical) zip equal-length but unequally-partition Key: SPARK-3364 URL: https://issues.apache.org/jira/browse/SPARK-3364 Project: Spark Issue Type: Bug Reporter: Kevin Jung ZippedRDD losts some elements after zipping RDDs with equal numbers of partitions but unequal numbers of elements in their each partitions. This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) with physically unbalanced HDFS file. {noformat} var x = sc.parallelize(1 to 9,3) var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i) var z = y.partitionBy(new RangePartitioner(3,y)) expected x.zip(y).count() 9 x.zip(y).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3))) unexpected x.zip(z).count() 7 x.zip(z).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), (5,(2,2)), (7,(3,3)), (8,(3,3))) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3364) Zip equal-length but unequally-partition
[ https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Jung updated SPARK-3364: -- Summary: Zip equal-length but unequally-partition (was: zip equal-length but unequally-partition) Zip equal-length but unequally-partition Key: SPARK-3364 URL: https://issues.apache.org/jira/browse/SPARK-3364 Project: Spark Issue Type: Bug Reporter: Kevin Jung ZippedRDD losts some elements after zipping RDDs with equal numbers of partitions but unequal numbers of elements in their each partitions. This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) with physically unbalanced HDFS file. {noformat} var x = sc.parallelize(1 to 9,3) var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i) var z = y.partitionBy(new RangePartitioner(3,y)) expected x.zip(y).count() 9 x.zip(y).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3))) unexpected x.zip(z).count() 7 x.zip(z).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), (5,(2,2)), (7,(3,3)), (8,(3,3))) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3364) Zip equal-length but unequally-partition
[ https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Jung updated SPARK-3364: -- Affects Version/s: (was: 1.0.2) Zip equal-length but unequally-partition Key: SPARK-3364 URL: https://issues.apache.org/jira/browse/SPARK-3364 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Kevin Jung ZippedRDD losts some elements after zipping RDDs with equal numbers of partitions but unequal numbers of elements in their each partitions. This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) with physically unbalanced HDFS file. {noformat} var x = sc.parallelize(1 to 9,3) var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i) var z = y.partitionBy(new RangePartitioner(3,y)) expected x.zip(y).count() 9 x.zip(y).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3))) unexpected x.zip(z).count() 7 x.zip(z).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), (5,(2,2)), (7,(3,3)), (8,(3,3))) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org