[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228136#comment-14228136 ] Takeshi Yamamuro commented on SPARK-1987: - What is the status of this patch? This is related to a issue I created (https://issues.apache.org/jira/browse/SPARK-4646). I refactored this patch based on my patch, it is as follows: https://github.com/maropu/spark/commit/77e34424a5e6cf2bfd6300ab35f329bdaba6e775 Thanks :) More memory-efficient graph construction Key: SPARK-1987 URL: https://issues.apache.org/jira/browse/SPARK-1987 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave A graph's edges are usually the largest component of the graph. GraphX currently stores edges in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for sorting, so each edge additionally takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This unnecessarily increases GraphX's memory requirements by a factor of 3. To save memory, EdgePartitionBuilder should instead use a custom sort routine that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228143#comment-14228143 ] Larry Xiao commented on SPARK-1987: --- [~maropu] I think it needs slight change in build system. I see your patch, cool idea, didn't know about timsort before, and your code looks very clear. :) More memory-efficient graph construction Key: SPARK-1987 URL: https://issues.apache.org/jira/browse/SPARK-1987 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave A graph's edges are usually the largest component of the graph. GraphX currently stores edges in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for sorting, so each edge additionally takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This unnecessarily increases GraphX's memory requirements by a factor of 3. To save memory, EdgePartitionBuilder should instead use a custom sort routine that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4645) Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver
[ https://issues.apache.org/jira/browse/SPARK-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-4645: -- Description: Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well for normal JDBC clients like BeeLine, but throws exception when using Simba ODBC driver v0.1.. Simba ODBC driver tries to execute two statement while connecting to Spark SQL HiveThriftServer2: - {{use `default`}} - {{set -v}} However, HiveThriftServer2 throws exception when executing them: {code} 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap space at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:84) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error running hive query: org.apache.hive.service.cli.HiveSQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap space at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:104) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} was: Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well for normal JDBC clients like BeeLine, but throws exception when using Simba ODBC driver. Simba ODBC driver tries to execute two statement while connecting to Spark SQL HiveThriftServer2: - {{use `default`}} - {{set -v}} However, HiveThriftServer2 throws exception when executing them: {code} 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query:
[jira] [Updated] (SPARK-4645) Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver
[ https://issues.apache.org/jira/browse/SPARK-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-4645: -- Description: Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well for normal JDBC clients like BeeLine, but throws exception when using Simba ODBC driver v1.0.0.1000. Simba ODBC driver tries to execute two statement while connecting to Spark SQL HiveThriftServer2: - {{use `default`}} - {{set -v}} However, HiveThriftServer2 throws exception when executing them: {code} 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap space at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:84) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error running hive query: org.apache.hive.service.cli.HiveSQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap space at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:104) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} was: Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well for normal JDBC clients like BeeLine, but throws exception when using Simba ODBC driver v0.1.. Simba ODBC driver tries to execute two statement while connecting to Spark SQL HiveThriftServer2: - {{use `default`}} - {{set -v}} However, HiveThriftServer2 throws exception when executing them: {code} 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query:
[jira] [Resolved] (SPARK-4619) Double ms in ShuffleBlockFetcherIterator log
[ https://issues.apache.org/jira/browse/SPARK-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4619. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: maji2014 Double ms in ShuffleBlockFetcherIterator log -- Key: SPARK-4619 URL: https://issues.apache.org/jira/browse/SPARK-4619 Project: Spark Issue Type: Bug Affects Versions: 1.1.2 Reporter: maji2014 Assignee: maji2014 Priority: Minor Fix For: 1.2.0 log as followings: ShuffleBlockFetcherIterator: Got local blocks in 8 ms ms reason: logInfo(Got local blocks in + Utils.getUsedTimeMs(startTime) + ms) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1442) Add Window function support
[ https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guowei updated SPARK-1442: -- Attachment: (was: Window Function.pdf) Add Window function support --- Key: SPARK-1442 URL: https://issues.apache.org/jira/browse/SPARK-1442 Project: Spark Issue Type: New Feature Components: SQL Reporter: Chengxiang Li Attachments: Window Function.pdf similiar to Hive, add window function support for catalyst. https://issues.apache.org/jira/browse/HIVE-4197 https://issues.apache.org/jira/browse/HIVE-896 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228208#comment-14228208 ] Takeshi Yamamuro commented on SPARK-1987: - Thanks for your review! :)) What's the change in the system? Anyway, if no problem, I'll send PR. Thanks, again. takeshi More memory-efficient graph construction Key: SPARK-1987 URL: https://issues.apache.org/jira/browse/SPARK-1987 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave A graph's edges are usually the largest component of the graph. GraphX currently stores edges in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for sorting, so each edge additionally takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This unnecessarily increases GraphX's memory requirements by a factor of 3. To save memory, EdgePartitionBuilder should instead use a custom sort routine that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4647) yarn-client mode reports success even though job fails
carlmartin created SPARK-4647: - Summary: yarn-client mode reports success even though job fails Key: SPARK-4647 URL: https://issues.apache.org/jira/browse/SPARK-4647 Project: Spark Issue Type: Bug Components: YARN Reporter: carlmartin yarn's web show SUCCEEDED when the driver throw a exception in yarn-client -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3293) yarn's web show SUCCEEDED when the driver throw a exception in yarn-client
[ https://issues.apache.org/jira/browse/SPARK-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228253#comment-14228253 ] carlmartin commented on SPARK-3293: --- [~tgraves][~andrewor14] It seems that SPARK-3627 did not fix this problem when using yarn-client mode. So I will do this work. yarn's web show SUCCEEDED when the driver throw a exception in yarn-client Key: SPARK-3293 URL: https://issues.apache.org/jira/browse/SPARK-3293 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2, 1.1.0 Reporter: wangfei Assignee: Guoqiang Li Fix For: 1.2.0 If an exception occurs, the yarn'web-Applications-FinalStatus will also be the SUCCEEDED without the expectation of FAILED. In the release of spark-1.0.2, only yarn-client mode will show this. But recently the yarn-cluster mode will also be a problem. To reply this: just new a sparkContext and then throw an exception then watch the yarn websit about applications -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3293) yarn's web show SUCCEEDED when the driver throw a exception in yarn-client
[ https://issues.apache.org/jira/browse/SPARK-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228256#comment-14228256 ] Apache Spark commented on SPARK-3293: - User 'SaintBacchus' has created a pull request for this issue: https://github.com/apache/spark/pull/3508 yarn's web show SUCCEEDED when the driver throw a exception in yarn-client Key: SPARK-3293 URL: https://issues.apache.org/jira/browse/SPARK-3293 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2, 1.1.0 Reporter: wangfei Assignee: Guoqiang Li Fix For: 1.2.0 If an exception occurs, the yarn'web-Applications-FinalStatus will also be the SUCCEEDED without the expectation of FAILED. In the release of spark-1.0.2, only yarn-client mode will show this. But recently the yarn-cluster mode will also be a problem. To reply this: just new a sparkContext and then throw an exception then watch the yarn websit about applications -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4641) A FileNotFoundException happened in Hash Shuffle Manager
[ https://issues.apache.org/jira/browse/SPARK-4641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228278#comment-14228278 ] Apache Spark commented on SPARK-4641: - User 'SaintBacchus' has created a pull request for this issue: https://github.com/apache/spark/pull/3509 A FileNotFoundException happened in Hash Shuffle Manager Key: SPARK-4641 URL: https://issues.apache.org/jira/browse/SPARK-4641 Project: Spark Issue Type: Bug Components: Input/Output, Shuffle Environment: A WordCount Example with some special text input(normal words text) Reporter: carlmartin Using Hash Shuffle without consolidateFiles, it will throw such exception: java.io.IOException: Error in reading org.apache.spark.network.FileSegmentManagedBuffer .. (actual file length 0) Caused by: java.io.FileNotFoundException: (No such file or directory) And using Hash Shuffle with consolidateFiles, it will throw another exception: java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4644) Implement skewed join
[ https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228279#comment-14228279 ] Lianhui Wang commented on SPARK-4644: - hi @Shixiong Zhu, with skew data, can we use broadcast join to implement it. i think performance of broadcast join is very higher. at last we can merge result of broadcast join common join. Implement skewed join - Key: SPARK-4644 URL: https://issues.apache.org/jira/browse/SPARK-4644 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Attachments: Skewed Join Design Doc.pdf Skewed data is not rare. For example, a book recommendation site may have several books which are liked by most of the users. Running ALS on such skewed data will raise a OutOfMemory error, if some book has too many users which cannot be fit into memory. To solve it, we propose a skewed join implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4002) KafkaStreamSuite Kafka input stream case fails on OSX
[ https://issues.apache.org/jira/browse/SPARK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228307#comment-14228307 ] Ryan Williams commented on SPARK-4002: -- This still occurs for me every time I run this test, fwiw. KafkaStreamSuite Kafka input stream case fails on OSX --- Key: SPARK-4002 URL: https://issues.apache.org/jira/browse/SPARK-4002 Project: Spark Issue Type: Bug Components: Streaming Environment: Mac OSX 10.9.5. Reporter: Ryan Williams Attachments: unit-tests.log [~sowen] mentioned this on spark-dev [here|http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccamassdjs0fmsdc-k-4orgbhbfz2vvrmm0hfyifeeal-spft...@mail.gmail.com%3E] and I just reproduced it on {{master}} ([7e63bb4|https://github.com/apache/spark/commit/7e63bb49c526c3f872619ae14e4b5273f4c535e9]). The relevant output I get when running {{./dev/run-tests}} is: {code} [info] KafkaStreamSuite: [info] - Kafka input stream *** FAILED *** [info] 3 did not equal 0 (KafkaStreamSuite.scala:135) [info] Test run started [info] Test org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream started [error] Test org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: junit.framework.AssertionFailedError: expected:3 but was:0 [error] at junit.framework.Assert.fail(Assert.java:50) [error] at junit.framework.Assert.failNotEquals(Assert.java:287) [error] at junit.framework.Assert.assertEquals(Assert.java:67) [error] at junit.framework.Assert.assertEquals(Assert.java:199) [error] at junit.framework.Assert.assertEquals(Assert.java:205) [error] at org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream(JavaKafkaStreamSuite.java:129) [error] ... [info] Test run finished: 1 failed, 0 ignored, 1 total, 14.451s Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128M; support was removed in 8.0 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1g; support was removed in 8.0 [info] ScalaTest [info] Run completed in 11 minutes, 39 seconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 [info] *** 1 TEST FAILED *** [error] Failed: Total 2, Failed 2, Errors 0, Passed 0 [error] Failed tests: [error] org.apache.spark.streaming.kafka.JavaKafkaStreamSuite [error] org.apache.spark.streaming.kafka.KafkaStreamSuite {code} This simplest command I know that reproduces this test failure is: {code} mvn test -Dsuites='*KafkaStreamSuite' {code} Often I have to {{mvn clean}} before or as part of running that command, otherwise I get other spurious compile errors or crashes, but that is another story. Seems like this test should be {{@Ignore}}'d, or some note about this made on the {{README.md}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4644) Implement skewed join
[ https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228308#comment-14228308 ] Shixiong Zhu commented on SPARK-4644: - I disagree to use `broadcast join` because: 1. `broadcast join` is in Spark SQL. It's not convenient for people who only want to use Spark Core. Some users (such as ALS in mllib) have already used `join` of Spark Core, and I don't think forcing users to rewrite them with Spark SQL is a good idea. 2. `broadcast join` assumes only one of two tables has skew keys. If both two tables have skew keys, how to handle it? I only know a little about Spark SQL. Please let me know if there is any mistake. Implement skewed join - Key: SPARK-4644 URL: https://issues.apache.org/jira/browse/SPARK-4644 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Attachments: Skewed Join Design Doc.pdf Skewed data is not rare. For example, a book recommendation site may have several books which are liked by most of the users. Running ALS on such skewed data will raise a OutOfMemory error, if some book has too many users which cannot be fit into memory. To solve it, we propose a skewed join implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4648) Use available Coalesce function in HiveQL instead of using HiveUDF. And support Coalesce in Spark SQL.
Ravindra Pesala created SPARK-4648: -- Summary: Use available Coalesce function in HiveQL instead of using HiveUDF. And support Coalesce in Spark SQL. Key: SPARK-4648 URL: https://issues.apache.org/jira/browse/SPARK-4648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Currently HiveQL uses Hive UDF function for Coalesce. Usually using hive udfs are memory intensive. Since Coalesce function is already available in Spark , we can make use of it. And also support Coalesce function in Spar SQL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4648) Use available Coalesce function in HiveQL instead of using HiveUDF. And support Coalesce in Spark SQL.
[ https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228312#comment-14228312 ] Apache Spark commented on SPARK-4648: - User 'ravipesala' has created a pull request for this issue: https://github.com/apache/spark/pull/3510 Use available Coalesce function in HiveQL instead of using HiveUDF. And support Coalesce in Spark SQL. -- Key: SPARK-4648 URL: https://issues.apache.org/jira/browse/SPARK-4648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Currently HiveQL uses Hive UDF function for Coalesce. Usually using hive udfs are memory intensive. Since Coalesce function is already available in Spark , we can make use of it. And also support Coalesce function in Spar SQL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3182) Twitter Streaming Geoloaction Filter
[ https://issues.apache.org/jira/browse/SPARK-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3182: --- Fix Version/s: (was: 1.2.0) Twitter Streaming Geoloaction Filter Key: SPARK-3182 URL: https://issues.apache.org/jira/browse/SPARK-3182 Project: Spark Issue Type: Wish Components: Streaming Affects Versions: 1.0.0, 1.0.2 Reporter: Daniel Kershaw Labels: features Original Estimate: 24h Remaining Estimate: 24h Add a geolocation filter to the Twitter Streaming Component. This should take a sequence of double to indicate the bounding box for the stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3182) Twitter Streaming Geoloaction Filter
[ https://issues.apache.org/jira/browse/SPARK-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3182: --- Affects Version/s: (was: 1.0.2) (was: 1.0.0) Twitter Streaming Geoloaction Filter Key: SPARK-3182 URL: https://issues.apache.org/jira/browse/SPARK-3182 Project: Spark Issue Type: Wish Components: Streaming Reporter: Daniel Kershaw Labels: features Original Estimate: 24h Remaining Estimate: 24h Add a geolocation filter to the Twitter Streaming Component. This should take a sequence of double to indicate the bounding box for the stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4645) Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver
[ https://issues.apache.org/jira/browse/SPARK-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4645: --- Assignee: Cheng Lian Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver - Key: SPARK-4645 URL: https://issues.apache.org/jira/browse/SPARK-4645 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well for normal JDBC clients like BeeLine, but throws exception when using Simba ODBC driver v1.0.0.1000. Simba ODBC driver tries to execute two statement while connecting to Spark SQL HiveThriftServer2: - {{use `default`}} - {{set -v}} However, HiveThriftServer2 throws exception when executing them: {code} 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap space at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:84) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error running hive query: org.apache.hive.service.cli.HiveSQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap space at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:104) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4632: --- Target Version/s: 1.3.0 (was: 1.2.0) Upgrade MQTT dependency to use latest mqtt-client - Key: SPARK-4632 URL: https://issues.apache.org/jira/browse/SPARK-4632 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is breaking Spark build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4632: --- Priority: Critical (was: Blocker) Upgrade MQTT dependency to use latest mqtt-client - Key: SPARK-4632 URL: https://issues.apache.org/jira/browse/SPARK-4632 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is breaking Spark build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4643) Remove unneeded staging repositories from build
[ https://issues.apache.org/jira/browse/SPARK-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4643. Resolution: Fixed Fix Version/s: 1.3.0 Remove unneeded staging repositories from build --- Key: SPARK-4643 URL: https://issues.apache.org/jira/browse/SPARK-4643 Project: Spark Issue Type: Improvement Components: Build Reporter: Adrian Wang Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4643) Remove unneeded staging repositories from build
[ https://issues.apache.org/jira/browse/SPARK-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4643: --- Summary: Remove unneeded staging repositories from build (was: spark staging repository location outdated) Remove unneeded staging repositories from build --- Key: SPARK-4643 URL: https://issues.apache.org/jira/browse/SPARK-4643 Project: Spark Issue Type: Improvement Components: Build Reporter: Adrian Wang Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4643) Remove unneeded staging repositories from build
[ https://issues.apache.org/jira/browse/SPARK-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4643: --- Assignee: Adrian Wang Remove unneeded staging repositories from build --- Key: SPARK-4643 URL: https://issues.apache.org/jira/browse/SPARK-4643 Project: Spark Issue Type: Improvement Components: Build Reporter: Adrian Wang Assignee: Adrian Wang Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4645) Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver
[ https://issues.apache.org/jira/browse/SPARK-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4645. Resolution: Fixed Fix Version/s: 1.2.0 Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver - Key: SPARK-4645 URL: https://issues.apache.org/jira/browse/SPARK-4645 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Fix For: 1.2.0 Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well for normal JDBC clients like BeeLine, but throws exception when using Simba ODBC driver v1.0.0.1000. Simba ODBC driver tries to execute two statement while connecting to Spark SQL HiveThriftServer2: - {{use `default`}} - {{set -v}} However, HiveThriftServer2 throws exception when executing them: {code} 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap space at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:84) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error running hive query: org.apache.hive.service.cli.HiveSQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap space at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:104) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} --
[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks
[ https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228438#comment-14228438 ] Masayoshi TSUZUKI commented on SPARK-4598: -- Discussion about this problem seems to be on the github PR ticket. https://github.com/apache/spark/pull/3456 Paginate stage page to avoid OOM with 100,000 tasks - Key: SPARK-4598 URL: https://issues.apache.org/jira/browse/SPARK-4598 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Priority: Critical In HistoryServer stage page, clicking the task href in Description, it occurs the GC error. The detail error message is: 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-352] | Error for /history/application_1416206401491_0010/stages/stage/ | org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590) java.lang.OutOfMemoryError: GC overhead limit exceeded 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-364] | handle failed | org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697) java.lang.OutOfMemoryError: GC overhead limit exceeded -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4193) Disable doclint in Java 8 to prevent from build error.
[ https://issues.apache.org/jira/browse/SPARK-4193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4193. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Takuya Ueshin https://github.com/apache/spark/pull/3058 Disable doclint in Java 8 to prevent from build error. -- Key: SPARK-4193 URL: https://issues.apache.org/jira/browse/SPARK-4193 Project: Spark Issue Type: Bug Components: Build Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks
[ https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masayoshi TSUZUKI updated SPARK-4598: - Comment: was deleted (was: Discussion about this problem seems to be on the github PR ticket. https://github.com/apache/spark/pull/3456 ) Paginate stage page to avoid OOM with 100,000 tasks - Key: SPARK-4598 URL: https://issues.apache.org/jira/browse/SPARK-4598 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Priority: Critical In HistoryServer stage page, clicking the task href in Description, it occurs the GC error. The detail error message is: 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-352] | Error for /history/application_1416206401491_0010/stages/stage/ | org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590) java.lang.OutOfMemoryError: GC overhead limit exceeded 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-364] | handle failed | org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697) java.lang.OutOfMemoryError: GC overhead limit exceeded -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks
[ https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228453#comment-14228453 ] Masayoshi TSUZUKI commented on SPARK-4598: -- The similar problem was reported on JIRA (https://issues.apache.org/jira/browse/SPARK-2017) but it's about the client side problem. When I saw the SPARK-2017 problem, I produced over 1,000,000 tasks but server didn't stop with OOM (just my web browser became unresponsive for several minutes). And @rxin and @carlosfuertes also didn't seem to get the server side OOM. What's the difference? The souce has been changed? It might be a clue as to solve the OOM to have a closer look at the difference. Paginate stage page to avoid OOM with 100,000 tasks - Key: SPARK-4598 URL: https://issues.apache.org/jira/browse/SPARK-4598 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Priority: Critical In HistoryServer stage page, clicking the task href in Description, it occurs the GC error. The detail error message is: 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-352] | Error for /history/application_1416206401491_0010/stages/stage/ | org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590) java.lang.OutOfMemoryError: GC overhead limit exceeded 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-364] | handle failed | org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697) java.lang.OutOfMemoryError: GC overhead limit exceeded -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4649) Add method unionAll to PySpark's SchemaRDD
Luca Foschini created SPARK-4649: Summary: Add method unionAll to PySpark's SchemaRDD Key: SPARK-4649 URL: https://issues.apache.org/jira/browse/SPARK-4649 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 1.1.0 Reporter: Luca Foschini Priority: Minor PySpark has no equivalent of Scala's SchemaRDD.unionAll. The standard SchemaRDD.union method downcasts the result to UnionRDD which makes it not amenable for chaining. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4648) Support Coalesce in Spark SQL.
[ https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-4648: --- Summary: Support Coalesce in Spark SQL. (was: Use available Coalesce function in HiveQL instead of using HiveUDF. And support Coalesce in Spark SQL.) Support Coalesce in Spark SQL. -- Key: SPARK-4648 URL: https://issues.apache.org/jira/browse/SPARK-4648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Currently HiveQL uses Hive UDF function for Coalesce. Usually using hive udfs are memory intensive. Since Coalesce function is already available in Spark , we can make use of it. And also support Coalesce function in Spar SQL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4648) Support Coalesce in Spark SQL.
[ https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-4648: --- Description: Support Coalesce function in Spark SQL (was: Currently HiveQL uses Hive UDF function for Coalesce. Usually using hive udfs are memory intensive. Since Coalesce function is already available in Spark , we can make use of it. And also support Coalesce function in Spar SQL) Support Coalesce in Spark SQL. -- Key: SPARK-4648 URL: https://issues.apache.org/jira/browse/SPARK-4648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Support Coalesce function in Spark SQL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4650) Supporting multi column support in count(distinct c1,c2..) in Spark SQL
Ravindra Pesala created SPARK-4650: -- Summary: Supporting multi column support in count(distinct c1,c2..) in Spark SQL Key: SPARK-4650 URL: https://issues.apache.org/jira/browse/SPARK-4650 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Support multi column support inside count(distinct c1,c2..) which is not working in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4575) Documentation for the pipeline features
[ https://issues.apache.org/jira/browse/SPARK-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228488#comment-14228488 ] Joseph K. Bradley commented on SPARK-4575: -- Perhaps this could take the form of 1 user guide section for the new API pipeline feature, plus subsections for existing algorithms which have been ported to the new spark.ml branch. Documentation for the pipeline features --- Key: SPARK-4575 URL: https://issues.apache.org/jira/browse/SPARK-4575 Project: Spark Issue Type: Improvement Components: Documentation, ML, MLlib Affects Versions: 1.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Add user guide for the newly added ML pipeline feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4650) Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Pesala updated SPARK-4650: --- Summary: Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL (was: Supporting multi column support in count(distinct c1,c2..) in Spark SQL) Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL --- Key: SPARK-4650 URL: https://issues.apache.org/jira/browse/SPARK-4650 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Support multi column support inside count(distinct c1,c2..) which is not working in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
[ https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228489#comment-14228489 ] Patrick Wendell commented on SPARK-3694: Yes we should print that too - I said that in the description. [~ilganeli] some other people are interested in working on this. Are you actively working on it? Allow printing object graph of tasks/RDD's with a debug flag Key: SPARK-3694 URL: https://issues.apache.org/jira/browse/SPARK-3694 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Ilya Ganelin Labels: starter This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html We'd want to print this trace for both the RDD serialization inside of the DAGScheduler and the task serialization in the TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
[ https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3694: --- Comment: was deleted (was: User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3091) Allow printing object graph of tasks/RDD's with a debug flag Key: SPARK-3694 URL: https://issues.apache.org/jira/browse/SPARK-3694 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Ilya Ganelin Labels: starter This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html We'd want to print this trace for both the RDD serialization inside of the DAGScheduler and the task serialization in the TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4650) Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228491#comment-14228491 ] Apache Spark commented on SPARK-4650: - User 'ravipesala' has created a pull request for this issue: https://github.com/apache/spark/pull/3511 Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL --- Key: SPARK-4650 URL: https://issues.apache.org/jira/browse/SPARK-4650 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Support multi column support inside count(distinct c1,c2..) which is not working in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization
[ https://issues.apache.org/jira/browse/SPARK-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228493#comment-14228493 ] Patrick Wendell commented on SPARK-4349: Hey Matt, It turns out that parallel collections are not the only RDD where our sampled pre-emptive serialization trick can break. Other types of RDD's can have discrepancies in the partitions such that some could serialize properly and others don't. And I think those other cases are actually more serious than the parallel collections RDD case because parallelize() is mostly used for prototyping. I've seen the more general issue affect production workloads so it would be good to fix. On top of this, we generally could stand to have better error reporting for failed serialization cases - (related work SPARK-3694). In terms of solutions to this problem, it would be nice to find a solution that works in the general case. Matt - did you look at all about how complicated it would be to catch these errors are the time of serialization and propagate them up correctly such that the task set is aborted? This would be the most general and robust solution, although it could be complicated. Spark driver hangs on sc.parallelize() if exception is thrown during serialization -- Key: SPARK-4349 URL: https://issues.apache.org/jira/browse/SPARK-4349 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Matt Cheah Fix For: 1.3.0 Executing the following in the Spark Shell will lead to the Spark Shell hanging after a stack trace is printed. The serializer is set to the Kryo serializer. {code} scala import com.esotericsoftware.kryo.io.Input import com.esotericsoftware.kryo.io.Input scala import com.esotericsoftware.kryo.io.Output import com.esotericsoftware.kryo.io.Output scala class MyKryoSerializable extends com.esotericsoftware.kryo.KryoSerializable { def write (kryo: com.esotericsoftware.kryo.Kryo, output: Output) { throw new com.esotericsoftware.kryo.KryoException; } ; def read (kryo: com.esotericsoftware.kryo.Kryo, input: Input) { throw new com.esotericsoftware.kryo.KryoException; } } defined class MyKryoSerializable scala sc.parallelize(Seq(new MyKryoSerializable, new MyKryoSerializable)).collect {code} A stack trace is printed during serialization as expected, but another stack trace is printed afterwards, indicating that the driver can't recover: {code} 14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not unique! akka.actor.PostRestartException: exception post restart (class java.io.IOException) at akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249) at akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247) at akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302) at akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247) at akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76) at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] is not unique! at akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130) at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77) at akka.actor.ActorCell.reserveChild(ActorCell.scala:369) at akka.actor.dungeon.Children$class.makeChild(Children.scala:202)
[jira] [Resolved] (SPARK-4584) 2x Performance regression for Spark-on-YARN
[ https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4584. Resolution: Fixed 2x Performance regression for Spark-on-YARN --- Key: SPARK-4584 URL: https://issues.apache.org/jira/browse/SPARK-4584 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Nishkam Ravi Assignee: Marcelo Vanzin Priority: Blocker Significant performance regression observed for Spark-on-YARN (upto 2x) after 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 from Oct 7th. Problem can be reproduced with JavaWordCount against a large enough input dataset in YARN cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4584) 2x Performance regression for Spark-on-YARN
[ https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4584: --- Fix Version/s: 1.2.0 2x Performance regression for Spark-on-YARN --- Key: SPARK-4584 URL: https://issues.apache.org/jira/browse/SPARK-4584 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Nishkam Ravi Assignee: Marcelo Vanzin Priority: Blocker Fix For: 1.2.0 Significant performance regression observed for Spark-on-YARN (upto 2x) after 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 from Oct 7th. Problem can be reproduced with JavaWordCount against a large enough input dataset in YARN cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks
[ https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228508#comment-14228508 ] Josh Rosen commented on SPARK-4598: --- [~meiyoula], Do you have a sample job / workload that will let me reproduce this issue? Which Spark version are you using and how big is your driver memory? Do you know if this is a regression from an earlier Spark version? Paginate stage page to avoid OOM with 100,000 tasks - Key: SPARK-4598 URL: https://issues.apache.org/jira/browse/SPARK-4598 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Priority: Critical In HistoryServer stage page, clicking the task href in Description, it occurs the GC error. The detail error message is: 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-352] | Error for /history/application_1416206401491_0010/stages/stage/ | org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590) java.lang.OutOfMemoryError: GC overhead limit exceeded 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-364] | handle failed | org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697) java.lang.OutOfMemoryError: GC overhead limit exceeded -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2985) Buffered data in BlockGenerator gets lost when receiver crashes
[ https://issues.apache.org/jira/browse/SPARK-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2985. Resolution: Invalid I think this represents a misunderstanding of the internal API's Buffered data in BlockGenerator gets lost when receiver crashes --- Key: SPARK-2985 URL: https://issues.apache.org/jira/browse/SPARK-2985 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0 Reporter: dai zhiyuan Priority: Critical If recevierTracker crashes,the buffer data of BlockGenerator will be lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1450) Specify the default zone in the EC2 script help
[ https://issues.apache.org/jira/browse/SPARK-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1450. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Sean Owen (was: Tathagata Das) Specify the default zone in the EC2 script help --- Key: SPARK-1450 URL: https://issues.apache.org/jira/browse/SPARK-1450 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 0.9.0 Reporter: Tathagata Das Assignee: Sean Owen Priority: Minor Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4352: --- Priority: Critical (was: Major) Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Critical Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4651) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile with newer versions of Hadoop
Tsuyoshi OZAWA created SPARK-4651: - Summary: Adding -Phadoop-2.5 and -Phadoop-2.6 to compile with newer versions of Hadoop Key: SPARK-4651 URL: https://issues.apache.org/jira/browse/SPARK-4651 Project: Spark Issue Type: Improvement Components: Build Reporter: Tsuyoshi OZAWA Currently, we don't have newer profiles to compile Spark with newer versions of Hadoop. We should have them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4651) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile with newer versions of Hadoop
[ https://issues.apache.org/jira/browse/SPARK-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228571#comment-14228571 ] Tsuyoshi OZAWA commented on SPARK-4651: --- I'll send PR via github soon. Adding -Phadoop-2.5 and -Phadoop-2.6 to compile with newer versions of Hadoop - Key: SPARK-4651 URL: https://issues.apache.org/jira/browse/SPARK-4651 Project: Spark Issue Type: Improvement Components: Build Reporter: Tsuyoshi OZAWA Currently, we don't have newer profiles to compile Spark with newer versions of Hadoop. We should have them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4651) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop
[ https://issues.apache.org/jira/browse/SPARK-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated SPARK-4651: -- Summary: Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop (was: Adding -Phadoop-2.5 and -Phadoop-2.6 to compile with newer versions of Hadoop) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop --- Key: SPARK-4651 URL: https://issues.apache.org/jira/browse/SPARK-4651 Project: Spark Issue Type: Improvement Components: Build Reporter: Tsuyoshi OZAWA Currently, we don't have newer profiles to compile Spark with newer versions of Hadoop. We should have them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4651) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop
[ https://issues.apache.org/jira/browse/SPARK-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228573#comment-14228573 ] Apache Spark commented on SPARK-4651: - User 'oza' has created a pull request for this issue: https://github.com/apache/spark/pull/3512 Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop --- Key: SPARK-4651 URL: https://issues.apache.org/jira/browse/SPARK-4651 Project: Spark Issue Type: Improvement Components: Build Reporter: Tsuyoshi OZAWA Currently, we don't have newer profiles to compile Spark with newer versions of Hadoop. We should have them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4651) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop
[ https://issues.apache.org/jira/browse/SPARK-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228574#comment-14228574 ] Sean Owen commented on SPARK-4651: -- I don't agree with adding these profiles, as they would be identical to the 2.4 profile. It's really a 2.4+ profile now. It's just more build complexity. There is no Hadoop 2.6 right now anyway. Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop --- Key: SPARK-4651 URL: https://issues.apache.org/jira/browse/SPARK-4651 Project: Spark Issue Type: Improvement Components: Build Reporter: Tsuyoshi OZAWA Currently, we don't have newer profiles to compile Spark with newer versions of Hadoop. We should have them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4651) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop
[ https://issues.apache.org/jira/browse/SPARK-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228576#comment-14228576 ] Tsuyoshi OZAWA commented on SPARK-4651: --- [~srowen], oops, I've thought it's already released... anyway, I'll add 2.4+ profile. Thanks for your review! Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop --- Key: SPARK-4651 URL: https://issues.apache.org/jira/browse/SPARK-4651 Project: Spark Issue Type: Improvement Components: Build Reporter: Tsuyoshi OZAWA Currently, we don't have newer profiles to compile Spark with newer versions of Hadoop. We should have them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4651) Adding -Phadoop-2.4+ to compile Spark with newer versions of Hadoop
[ https://issues.apache.org/jira/browse/SPARK-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated SPARK-4651: -- Summary: Adding -Phadoop-2.4+ to compile Spark with newer versions of Hadoop (was: Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop) Adding -Phadoop-2.4+ to compile Spark with newer versions of Hadoop --- Key: SPARK-4651 URL: https://issues.apache.org/jira/browse/SPARK-4651 Project: Spark Issue Type: Improvement Components: Build Reporter: Tsuyoshi OZAWA Currently, we don't have newer profiles to compile Spark with newer versions of Hadoop. We should have them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
[ https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228580#comment-14228580 ] Ilya Ganelin commented on SPARK-3694: - Hi Patrick - I am working on it - I am just trying to finalize a test for this. The reason I asked about task serialization is that in the description you talk about task serialization within the TaskSetManager, not the task serialization within the DAGScheduler - for the DAGScheduler you only mention RDD serialization. I wanted to confirm whether to print the task serialization for the DAGScheduler as well as the task serialization for the TaskSetManager. Allow printing object graph of tasks/RDD's with a debug flag Key: SPARK-3694 URL: https://issues.apache.org/jira/browse/SPARK-3694 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Ilya Ganelin Labels: starter This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html We'd want to print this trace for both the RDD serialization inside of the DAGScheduler and the task serialization in the TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4342) connection ack timeout improvement, replace Timer with ScheudledExecutor...
[ https://issues.apache.org/jira/browse/SPARK-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228590#comment-14228590 ] Josh Rosen commented on SPARK-4342: --- This looks like a duplicate of SPARK-4393, which has been fixed, so I've resolved this as duplicate. connection ack timeout improvement, replace Timer with ScheudledExecutor... --- Key: SPARK-4342 URL: https://issues.apache.org/jira/browse/SPARK-4342 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Haitao Yao replace java.util.Timer with scheduledExecutorService, use message id directly in the task. for details, see the mailing list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4635) Delete the val that never used in execute() of HashOuterJoin.
[ https://issues.apache.org/jira/browse/SPARK-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 closed SPARK-4635. - Resolution: Not a Problem Delete the val that never used in execute() of HashOuterJoin. -- Key: SPARK-4635 URL: https://issues.apache.org/jira/browse/SPARK-4635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: DoingDone9 Priority: Minor The val boundCondition is created in execute(),but it never be used in execute(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4342) connection ack timeout improvement, replace Timer with ScheudledExecutor...
[ https://issues.apache.org/jira/browse/SPARK-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4342. --- Resolution: Duplicate connection ack timeout improvement, replace Timer with ScheudledExecutor... --- Key: SPARK-4342 URL: https://issues.apache.org/jira/browse/SPARK-4342 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Haitao Yao replace java.util.Timer with scheduledExecutorService, use message id directly in the task. for details, see the mailing list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4597) Use proper exception and reset variable in Utils.createTempDir() method
[ https://issues.apache.org/jira/browse/SPARK-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4597: -- Summary: Use proper exception and reset variable in Utils.createTempDir() method (was: Use proper exception and reset variable) Use proper exception and reset variable in Utils.createTempDir() method --- Key: SPARK-4597 URL: https://issues.apache.org/jira/browse/SPARK-4597 Project: Spark Issue Type: Bug Reporter: Liang-Chi Hsieh Priority: Minor In Utils.scala, File.exists() and File.mkdirs() only throw SecurityException instead of IOException. Then, when an exception is thrown, the variable dir should be reset too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4597) Use proper exception and reset variable in Utils.createTempDir() method
[ https://issues.apache.org/jira/browse/SPARK-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4597: -- Affects Version/s: 1.2.0 1.1.1 1.0.2 Use proper exception and reset variable in Utils.createTempDir() method --- Key: SPARK-4597 URL: https://issues.apache.org/jira/browse/SPARK-4597 Project: Spark Issue Type: Bug Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Liang-Chi Hsieh Priority: Minor In Utils.scala, File.exists() and File.mkdirs() only throw SecurityException instead of IOException. Then, when an exception is thrown, the variable dir should be reset too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4597) Use proper exception and reset variable in Utils.createTempDir() method
[ https://issues.apache.org/jira/browse/SPARK-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228593#comment-14228593 ] Josh Rosen commented on SPARK-4597: --- Resolved by https://github.com/apache/spark/pull/3449 Use proper exception and reset variable in Utils.createTempDir() method --- Key: SPARK-4597 URL: https://issues.apache.org/jira/browse/SPARK-4597 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor Fix For: 1.0.3, 1.1.2, 1.2.1 In Utils.scala, File.exists() and File.mkdirs() only throw SecurityException instead of IOException. Then, when an exception is thrown, the variable dir should be reset too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4597) Use proper exception and reset variable in Utils.createTempDir() method
[ https://issues.apache.org/jira/browse/SPARK-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4597. --- Resolution: Fixed Fix Version/s: 1.2.1 1.1.2 1.0.3 Assignee: Liang-Chi Hsieh Use proper exception and reset variable in Utils.createTempDir() method --- Key: SPARK-4597 URL: https://issues.apache.org/jira/browse/SPARK-4597 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor Fix For: 1.0.3, 1.1.2, 1.2.1 In Utils.scala, File.exists() and File.mkdirs() only throw SecurityException instead of IOException. Then, when an exception is thrown, the variable dir should be reset too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2031) DAGScheduler supports pluggable clock
[ https://issues.apache.org/jira/browse/SPARK-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2031. --- Resolution: Fixed Fix Version/s: 1.1.0 DAGScheduler supports pluggable clock - Key: SPARK-2031 URL: https://issues.apache.org/jira/browse/SPARK-2031 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.1, 1.0.0 Reporter: Chen Chao Assignee: Chen Chao Fix For: 1.1.0 DAGScheduler supports pluggable clock like what TaskSetManager does. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4082) Show Waiting/Queued Stages in Spark UI
[ https://issues.apache.org/jira/browse/SPARK-4082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228596#comment-14228596 ] Josh Rosen commented on SPARK-4082: --- My Jobs Page pull request added this on the per-job pages as lists of pending stages. Does that address this issue, or do you think we should have a global list of pending / queued stages? Show Waiting/Queued Stages in Spark UI -- Key: SPARK-4082 URL: https://issues.apache.org/jira/browse/SPARK-4082 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Pat McDonough In the Stages UI page, It would be helpful to show the user any stages the DAGScheduler has planned but are not yet active. Currently, this info is not shown to the user in any way. /CC [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3003) FailedStage could not be cancelled by DAGScheduler when cancelJob or cancelStage
[ https://issues.apache.org/jira/browse/SPARK-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3003. --- Resolution: Incomplete Resolving as Incomplete since this couldn't be reproduced in a newer Spark version. FailedStage could not be cancelled by DAGScheduler when cancelJob or cancelStage Key: SPARK-3003 URL: https://issues.apache.org/jira/browse/SPARK-3003 Project: Spark Issue Type: Bug Components: Spark Core Reporter: YanTang Zhai Priority: Minor Some stage is changed from running to failed, then DAGSCheduler could not cancel it when cancelJob or cancelStage. Since in failJobAndIndependentStages, DAGSCheduler will only cancel runningStage and post SparkListenerStageCompleted for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
[ https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228631#comment-14228631 ] Ilya Ganelin commented on SPARK-3694: - Tests are completed and I will be submitting a pull request shortly. Allow printing object graph of tasks/RDD's with a debug flag Key: SPARK-3694 URL: https://issues.apache.org/jira/browse/SPARK-3694 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Ilya Ganelin Labels: starter This would be useful for debugging extra references inside of RDD's Here is an example for inspiration: http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html We'd want to print this trace for both the RDD serialization inside of the DAGScheduler and the task serialization in the TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4649) Add method unionAll to PySpark's SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-4649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228640#comment-14228640 ] Anant Daksh Asthana commented on SPARK-4649: I would like to take eon this task. Add method unionAll to PySpark's SchemaRDD --- Key: SPARK-4649 URL: https://issues.apache.org/jira/browse/SPARK-4649 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 1.1.0 Reporter: Luca Foschini Priority: Minor PySpark has no equivalent of Scala's SchemaRDD.unionAll. The standard SchemaRDD.union method downcasts the result to UnionRDD which makes it not amenable for chaining. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks
[ https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228646#comment-14228646 ] meiyoula commented on SPARK-4598: - [~joshrosen], I use the this two-day's github master code to test this, and just run the example SparkPi with default driver memory. It is the command: ./spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client ../lib/spark-examples*.jar 10 When the application is running and has executed 50,000 tasks, I open the stagepage in SparkUI, the web shutdown; When the application is finished, I open the stagepage in HistoryServer, the web shutdown. Attention, the HistoryServer memory is also use default value. Paginate stage page to avoid OOM with 100,000 tasks - Key: SPARK-4598 URL: https://issues.apache.org/jira/browse/SPARK-4598 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Priority: Critical In HistoryServer stage page, clicking the task href in Description, it occurs the GC error. The detail error message is: 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-352] | Error for /history/application_1416206401491_0010/stages/stage/ | org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590) java.lang.OutOfMemoryError: GC overhead limit exceeded 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-364] | handle failed | org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697) java.lang.OutOfMemoryError: GC overhead limit exceeded -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org