[jira] [Commented] (SPARK-1987) More memory-efficient graph construction

2014-11-28 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228136#comment-14228136
 ] 

Takeshi Yamamuro commented on SPARK-1987:
-

What is the status of this patch?
This is related to a issue I created 
(https://issues.apache.org/jira/browse/SPARK-4646).
I refactored this patch based on my patch, it is as follows:
https://github.com/maropu/spark/commit/77e34424a5e6cf2bfd6300ab35f329bdaba6e775

Thanks :)

 More memory-efficient graph construction
 

 Key: SPARK-1987
 URL: https://issues.apache.org/jira/browse/SPARK-1987
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave

 A graph's edges are usually the largest component of the graph. GraphX 
 currently stores edges in parallel primitive arrays, so each edge should only 
 take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the 
 current implementation in EdgePartitionBuilder uses an array of Edge objects 
 as an intermediate representation for sorting, so each edge additionally 
 takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr 
 (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This 
 unnecessarily increases GraphX's memory requirements by a factor of 3.
 To save memory, EdgePartitionBuilder should instead use a custom sort routine 
 that operates directly on the three parallel arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1987) More memory-efficient graph construction

2014-11-28 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228143#comment-14228143
 ] 

Larry Xiao commented on SPARK-1987:
---

[~maropu] I think it needs slight change in build system.
I see your patch, cool idea, didn't know about timsort before, and your code 
looks very clear. :)

 More memory-efficient graph construction
 

 Key: SPARK-1987
 URL: https://issues.apache.org/jira/browse/SPARK-1987
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave

 A graph's edges are usually the largest component of the graph. GraphX 
 currently stores edges in parallel primitive arrays, so each edge should only 
 take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the 
 current implementation in EdgePartitionBuilder uses an array of Edge objects 
 as an intermediate representation for sorting, so each edge additionally 
 takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr 
 (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This 
 unnecessarily increases GraphX's memory requirements by a factor of 3.
 To save memory, EdgePartitionBuilder should instead use a custom sort routine 
 that operates directly on the three parallel arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4645) Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver

2014-11-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-4645:
--
Description: 
Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So 
does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well 
for normal JDBC clients like BeeLine, but throws exception when using Simba 
ODBC driver v0.1..

Simba ODBC driver tries to execute two statement while connecting to Spark SQL 
HiveThriftServer2:

- {{use `default`}}
- {{set -v}}

However, HiveThriftServer2 throws exception when executing them:
{code}
14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query:
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap 
space
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
at 
org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
at 
org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error running hive 
query: 
org.apache.hive.service.cli.HiveSQLException: 
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap 
space
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:104)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So 
does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well 
for normal JDBC clients like BeeLine, but throws exception when using Simba 
ODBC driver.

Simba ODBC driver tries to execute two statement while connecting to Spark SQL 
HiveThriftServer2:

- {{use `default`}}
- {{set -v}}

However, HiveThriftServer2 throws exception when executing them:
{code}
14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query:

[jira] [Updated] (SPARK-4645) Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver

2014-11-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-4645:
--
Description: 
Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So 
does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well 
for normal JDBC clients like BeeLine, but throws exception when using Simba 
ODBC driver v1.0.0.1000.

Simba ODBC driver tries to execute two statement while connecting to Spark SQL 
HiveThriftServer2:

- {{use `default`}}
- {{set -v}}

However, HiveThriftServer2 throws exception when executing them:
{code}
14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query:
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap 
space
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
at 
org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
at 
org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error running hive 
query: 
org.apache.hive.service.cli.HiveSQLException: 
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap 
space
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:104)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So 
does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well 
for normal JDBC clients like BeeLine, but throws exception when using Simba 
ODBC driver v0.1..

Simba ODBC driver tries to execute two statement while connecting to Spark SQL 
HiveThriftServer2:

- {{use `default`}}
- {{set -v}}

However, HiveThriftServer2 throws exception when executing them:
{code}
14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query:

[jira] [Resolved] (SPARK-4619) Double ms in ShuffleBlockFetcherIterator log

2014-11-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4619.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: maji2014

 Double ms in ShuffleBlockFetcherIterator log
 --

 Key: SPARK-4619
 URL: https://issues.apache.org/jira/browse/SPARK-4619
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.2
Reporter: maji2014
Assignee: maji2014
Priority: Minor
 Fix For: 1.2.0


 log as followings: 
 ShuffleBlockFetcherIterator: Got local blocks in  8 ms ms
 reason:
 logInfo(Got local blocks in  + Utils.getUsedTimeMs(startTime) +  ms)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1442) Add Window function support

2014-11-28 Thread guowei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guowei updated SPARK-1442:
--
Attachment: (was: Window Function.pdf)

 Add Window function support
 ---

 Key: SPARK-1442
 URL: https://issues.apache.org/jira/browse/SPARK-1442
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Chengxiang Li
 Attachments: Window Function.pdf


 similiar to Hive, add window function support for catalyst.
 https://issues.apache.org/jira/browse/HIVE-4197
 https://issues.apache.org/jira/browse/HIVE-896



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1987) More memory-efficient graph construction

2014-11-28 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228208#comment-14228208
 ] 

Takeshi Yamamuro commented on SPARK-1987:
-

Thanks for your review! :))
What's the change in the system?

Anyway, if no problem, I'll send PR.
Thanks, again.
takeshi

 More memory-efficient graph construction
 

 Key: SPARK-1987
 URL: https://issues.apache.org/jira/browse/SPARK-1987
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave

 A graph's edges are usually the largest component of the graph. GraphX 
 currently stores edges in parallel primitive arrays, so each edge should only 
 take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the 
 current implementation in EdgePartitionBuilder uses an array of Edge objects 
 as an intermediate representation for sorting, so each edge additionally 
 takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr 
 (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This 
 unnecessarily increases GraphX's memory requirements by a factor of 3.
 To save memory, EdgePartitionBuilder should instead use a custom sort routine 
 that operates directly on the three parallel arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4647) yarn-client mode reports success even though job fails

2014-11-28 Thread carlmartin (JIRA)
carlmartin created SPARK-4647:
-

 Summary: yarn-client mode reports success even though job fails
 Key: SPARK-4647
 URL: https://issues.apache.org/jira/browse/SPARK-4647
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: carlmartin


yarn's web show SUCCEEDED when the driver throw a exception in yarn-client



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3293) yarn's web show SUCCEEDED when the driver throw a exception in yarn-client

2014-11-28 Thread carlmartin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228253#comment-14228253
 ] 

carlmartin commented on SPARK-3293:
---

[~tgraves][~andrewor14] It seems that SPARK-3627 did not fix this problem when 
using yarn-client mode. So I will do this work.

 yarn's web show SUCCEEDED when the driver throw a exception in yarn-client
 

 Key: SPARK-3293
 URL: https://issues.apache.org/jira/browse/SPARK-3293
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2, 1.1.0
Reporter: wangfei
Assignee: Guoqiang Li
 Fix For: 1.2.0


 If an exception occurs, the yarn'web-Applications-FinalStatus will also be 
 the SUCCEEDED without the expectation of FAILED.
 In the release of spark-1.0.2, only yarn-client mode will show this.
 But recently the yarn-cluster mode will also be a problem.
 To reply this:
 just new a sparkContext and then throw an exception
 then watch the yarn websit about applications



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3293) yarn's web show SUCCEEDED when the driver throw a exception in yarn-client

2014-11-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228256#comment-14228256
 ] 

Apache Spark commented on SPARK-3293:
-

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3508

 yarn's web show SUCCEEDED when the driver throw a exception in yarn-client
 

 Key: SPARK-3293
 URL: https://issues.apache.org/jira/browse/SPARK-3293
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2, 1.1.0
Reporter: wangfei
Assignee: Guoqiang Li
 Fix For: 1.2.0


 If an exception occurs, the yarn'web-Applications-FinalStatus will also be 
 the SUCCEEDED without the expectation of FAILED.
 In the release of spark-1.0.2, only yarn-client mode will show this.
 But recently the yarn-cluster mode will also be a problem.
 To reply this:
 just new a sparkContext and then throw an exception
 then watch the yarn websit about applications



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4641) A FileNotFoundException happened in Hash Shuffle Manager

2014-11-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228278#comment-14228278
 ] 

Apache Spark commented on SPARK-4641:
-

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3509

 A FileNotFoundException happened in Hash Shuffle Manager
 

 Key: SPARK-4641
 URL: https://issues.apache.org/jira/browse/SPARK-4641
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Shuffle
 Environment: A WordCount Example with some special text input(normal 
 words text)
Reporter: carlmartin

 Using Hash Shuffle without consolidateFiles, it will throw such exception:
   java.io.IOException: Error in reading 
 org.apache.spark.network.FileSegmentManagedBuffer .. (actual file length 0)
   Caused by: java.io.FileNotFoundException:  (No such file or directory)
 And using Hash Shuffle with consolidateFiles, it will throw another 
 exception: 
 java.io.IOException: PARSING_ERROR(2)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4644) Implement skewed join

2014-11-28 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228279#comment-14228279
 ] 

Lianhui Wang commented on SPARK-4644:
-

hi @Shixiong Zhu, with skew data, can we use broadcast join to implement it. i 
think performance of broadcast join is very higher. at last we can merge result 
of broadcast join  common join.

 Implement skewed join
 -

 Key: SPARK-4644
 URL: https://issues.apache.org/jira/browse/SPARK-4644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
 Attachments: Skewed Join Design Doc.pdf


 Skewed data is not rare. For example, a book recommendation site may have 
 several books which are liked by most of the users. Running ALS on such 
 skewed data will raise a OutOfMemory error, if some book has too many users 
 which cannot be fit into memory. To solve it, we propose a skewed join 
 implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4002) KafkaStreamSuite Kafka input stream case fails on OSX

2014-11-28 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228307#comment-14228307
 ] 

Ryan Williams commented on SPARK-4002:
--

This still occurs for me every time I run this test, fwiw.

 KafkaStreamSuite Kafka input stream case fails on OSX
 ---

 Key: SPARK-4002
 URL: https://issues.apache.org/jira/browse/SPARK-4002
 Project: Spark
  Issue Type: Bug
  Components: Streaming
 Environment: Mac OSX 10.9.5.
Reporter: Ryan Williams
 Attachments: unit-tests.log


 [~sowen] mentioned this on spark-dev 
 [here|http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccamassdjs0fmsdc-k-4orgbhbfz2vvrmm0hfyifeeal-spft...@mail.gmail.com%3E]
  and I just reproduced it on {{master}} 
 ([7e63bb4|https://github.com/apache/spark/commit/7e63bb49c526c3f872619ae14e4b5273f4c535e9]).
 The relevant output I get when running {{./dev/run-tests}} is:
 {code}
 [info] KafkaStreamSuite:
 [info] - Kafka input stream *** FAILED ***
 [info]   3 did not equal 0 (KafkaStreamSuite.scala:135)
 [info] Test run started
 [info] Test 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream started
 [error] Test 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: 
 junit.framework.AssertionFailedError: expected:3 but was:0
 [error] at junit.framework.Assert.fail(Assert.java:50)
 [error] at junit.framework.Assert.failNotEquals(Assert.java:287)
 [error] at junit.framework.Assert.assertEquals(Assert.java:67)
 [error] at junit.framework.Assert.assertEquals(Assert.java:199)
 [error] at junit.framework.Assert.assertEquals(Assert.java:205)
 [error] at 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream(JavaKafkaStreamSuite.java:129)
 [error] ...
 [info] Test run finished: 1 failed, 0 ignored, 1 total, 14.451s
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128M; 
 support was removed in 8.0
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1g; 
 support was removed in 8.0
 [info] ScalaTest
 [info] Run completed in 11 minutes, 39 seconds.
 [info] Total number of tests run: 1
 [info] Suites: completed 1, aborted 0
 [info] Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
 [info] *** 1 TEST FAILED ***
 [error] Failed: Total 2, Failed 2, Errors 0, Passed 0
 [error] Failed tests:
 [error]   org.apache.spark.streaming.kafka.JavaKafkaStreamSuite
 [error]   org.apache.spark.streaming.kafka.KafkaStreamSuite
 {code}
 This simplest command I know that reproduces this test failure is:
 {code}
 mvn test -Dsuites='*KafkaStreamSuite'
 {code}
 Often I have to {{mvn clean}} before or as part of running that command, 
 otherwise I get other spurious compile errors or crashes, but that is another 
 story.
 Seems like this test should be {{@Ignore}}'d, or some note about this made on 
 the {{README.md}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4644) Implement skewed join

2014-11-28 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228308#comment-14228308
 ] 

Shixiong Zhu commented on SPARK-4644:
-

I disagree to use `broadcast join` because:

1. `broadcast join` is in Spark SQL. It's not convenient for people who only 
want to use Spark Core. Some users (such as ALS in mllib) have already used 
`join` of Spark Core, and I don't think forcing users to rewrite them with 
Spark SQL is a good idea.

2. `broadcast join` assumes only one of two tables has skew keys. If both two 
tables have skew keys, how to handle it?

I only know a little about Spark SQL. Please let me know if there is any 
mistake.

 Implement skewed join
 -

 Key: SPARK-4644
 URL: https://issues.apache.org/jira/browse/SPARK-4644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
 Attachments: Skewed Join Design Doc.pdf


 Skewed data is not rare. For example, a book recommendation site may have 
 several books which are liked by most of the users. Running ALS on such 
 skewed data will raise a OutOfMemory error, if some book has too many users 
 which cannot be fit into memory. To solve it, we propose a skewed join 
 implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4648) Use available Coalesce function in HiveQL instead of using HiveUDF. And support Coalesce in Spark SQL.

2014-11-28 Thread Ravindra Pesala (JIRA)
Ravindra Pesala created SPARK-4648:
--

 Summary: Use available Coalesce function in HiveQL instead of 
using HiveUDF. And support Coalesce in Spark SQL.
 Key: SPARK-4648
 URL: https://issues.apache.org/jira/browse/SPARK-4648
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala


Currently HiveQL uses Hive UDF function for Coalesce. Usually using hive udfs 
are memory intensive. Since Coalesce function is already available in Spark , 
we can make use of it. 
And also support Coalesce function in Spar SQL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4648) Use available Coalesce function in HiveQL instead of using HiveUDF. And support Coalesce in Spark SQL.

2014-11-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228312#comment-14228312
 ] 

Apache Spark commented on SPARK-4648:
-

User 'ravipesala' has created a pull request for this issue:
https://github.com/apache/spark/pull/3510

 Use available Coalesce function in HiveQL instead of using HiveUDF. And 
 support Coalesce in Spark SQL.
 --

 Key: SPARK-4648
 URL: https://issues.apache.org/jira/browse/SPARK-4648
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala

 Currently HiveQL uses Hive UDF function for Coalesce. Usually using hive udfs 
 are memory intensive. Since Coalesce function is already available in Spark , 
 we can make use of it. 
 And also support Coalesce function in Spar SQL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3182) Twitter Streaming Geoloaction Filter

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3182:
---
Fix Version/s: (was: 1.2.0)

 Twitter Streaming Geoloaction Filter
 

 Key: SPARK-3182
 URL: https://issues.apache.org/jira/browse/SPARK-3182
 Project: Spark
  Issue Type: Wish
  Components: Streaming
Affects Versions: 1.0.0, 1.0.2
Reporter: Daniel Kershaw
  Labels: features
   Original Estimate: 24h
  Remaining Estimate: 24h

 Add a geolocation filter to the Twitter Streaming Component. 
 This should take a sequence of double to indicate the bounding box for the 
 stream. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3182) Twitter Streaming Geoloaction Filter

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3182:
---
Affects Version/s: (was: 1.0.2)
   (was: 1.0.0)

 Twitter Streaming Geoloaction Filter
 

 Key: SPARK-3182
 URL: https://issues.apache.org/jira/browse/SPARK-3182
 Project: Spark
  Issue Type: Wish
  Components: Streaming
Reporter: Daniel Kershaw
  Labels: features
   Original Estimate: 24h
  Remaining Estimate: 24h

 Add a geolocation filter to the Twitter Streaming Component. 
 This should take a sequence of double to indicate the bounding box for the 
 stream. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4645) Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4645:
---
Assignee: Cheng Lian

 Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play 
 well with Simba ODBC driver
 -

 Key: SPARK-4645
 URL: https://issues.apache.org/jira/browse/SPARK-4645
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. 
 So does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works 
 well for normal JDBC clients like BeeLine, but throws exception when using 
 Simba ODBC driver v1.0.0.1000.
 Simba ODBC driver tries to execute two statement while connecting to Spark 
 SQL HiveThriftServer2:
 - {{use `default`}}
 - {{set -v}}
 However, HiveThriftServer2 throws exception when executing them:
 {code}
 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query:
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
 Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap 
 space
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
   at 
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
   at 
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
   at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
   at 
 org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:84)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
   at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error running hive 
 query: 
 org.apache.hive.service.cli.HiveSQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
 Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap 
 space
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:104)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
   at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4632:
---
Target Version/s: 1.3.0  (was: 1.2.0)

 Upgrade MQTT dependency to use latest mqtt-client
 -

 Key: SPARK-4632
 URL: https://issues.apache.org/jira/browse/SPARK-4632
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is 
 breaking Spark build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4632:
---
Priority: Critical  (was: Blocker)

 Upgrade MQTT dependency to use latest mqtt-client
 -

 Key: SPARK-4632
 URL: https://issues.apache.org/jira/browse/SPARK-4632
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is 
 breaking Spark build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4643) Remove unneeded staging repositories from build

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4643.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Remove unneeded staging repositories from build
 ---

 Key: SPARK-4643
 URL: https://issues.apache.org/jira/browse/SPARK-4643
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Adrian Wang
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4643) Remove unneeded staging repositories from build

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4643:
---
Summary: Remove unneeded staging repositories from build  (was: spark 
staging repository location outdated)

 Remove unneeded staging repositories from build
 ---

 Key: SPARK-4643
 URL: https://issues.apache.org/jira/browse/SPARK-4643
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Adrian Wang
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4643) Remove unneeded staging repositories from build

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4643:
---
Assignee: Adrian Wang

 Remove unneeded staging repositories from build
 ---

 Key: SPARK-4643
 URL: https://issues.apache.org/jira/browse/SPARK-4643
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Adrian Wang
Assignee: Adrian Wang
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4645) Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4645.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play 
 well with Simba ODBC driver
 -

 Key: SPARK-4645
 URL: https://issues.apache.org/jira/browse/SPARK-4645
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.2.0


 Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. 
 So does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works 
 well for normal JDBC clients like BeeLine, but throws exception when using 
 Simba ODBC driver v1.0.0.1000.
 Simba ODBC driver tries to execute two statement while connecting to Spark 
 SQL HiveThriftServer2:
 - {{use `default`}}
 - {{set -v}}
 However, HiveThriftServer2 throws exception when executing them:
 {code}
 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query:
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
 Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap 
 space
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
   at 
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
   at 
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
   at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
   at 
 org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:84)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
   at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error running hive 
 query: 
 org.apache.hive.service.cli.HiveSQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
 Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap 
 space
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:104)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
   at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--

[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks

2014-11-28 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228438#comment-14228438
 ] 

Masayoshi TSUZUKI commented on SPARK-4598:
--

Discussion about this problem seems to be on the github PR ticket.
https://github.com/apache/spark/pull/3456


 Paginate stage page to avoid OOM with  100,000 tasks
 -

 Key: SPARK-4598
 URL: https://issues.apache.org/jira/browse/SPARK-4598
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Priority: Critical

 In HistoryServer stage page, clicking the task href in Description, it occurs 
 the GC error. The detail error message is:
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-352] | Error for 
 /history/application_1416206401491_0010/stages/stage/ | 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590)
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-364] | handle failed | 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697)
 java.lang.OutOfMemoryError: GC overhead limit exceeded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4193) Disable doclint in Java 8 to prevent from build error.

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4193.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Takuya Ueshin

https://github.com/apache/spark/pull/3058

 Disable doclint in Java 8 to prevent from build error.
 --

 Key: SPARK-4193
 URL: https://issues.apache.org/jira/browse/SPARK-4193
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks

2014-11-28 Thread Masayoshi TSUZUKI (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masayoshi TSUZUKI updated SPARK-4598:
-
Comment: was deleted

(was: Discussion about this problem seems to be on the github PR ticket.
https://github.com/apache/spark/pull/3456
)

 Paginate stage page to avoid OOM with  100,000 tasks
 -

 Key: SPARK-4598
 URL: https://issues.apache.org/jira/browse/SPARK-4598
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Priority: Critical

 In HistoryServer stage page, clicking the task href in Description, it occurs 
 the GC error. The detail error message is:
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-352] | Error for 
 /history/application_1416206401491_0010/stages/stage/ | 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590)
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-364] | handle failed | 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697)
 java.lang.OutOfMemoryError: GC overhead limit exceeded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks

2014-11-28 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228453#comment-14228453
 ] 

Masayoshi TSUZUKI commented on SPARK-4598:
--

The similar problem was reported on JIRA 
(https://issues.apache.org/jira/browse/SPARK-2017) but it's about the client 
side problem.
When I saw the SPARK-2017 problem, I produced over 1,000,000 tasks but server 
didn't stop with OOM (just my web browser became unresponsive for several 
minutes). And @rxin and @carlosfuertes also didn't seem to get the server side 
OOM.
What's the difference? The souce has been changed? It might be a clue as to 
solve the OOM to have a closer look at the difference.

 Paginate stage page to avoid OOM with  100,000 tasks
 -

 Key: SPARK-4598
 URL: https://issues.apache.org/jira/browse/SPARK-4598
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Priority: Critical

 In HistoryServer stage page, clicking the task href in Description, it occurs 
 the GC error. The detail error message is:
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-352] | Error for 
 /history/application_1416206401491_0010/stages/stage/ | 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590)
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-364] | handle failed | 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697)
 java.lang.OutOfMemoryError: GC overhead limit exceeded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4649) Add method unionAll to PySpark's SchemaRDD

2014-11-28 Thread Luca Foschini (JIRA)
Luca Foschini created SPARK-4649:


 Summary: Add method unionAll to PySpark's SchemaRDD 
 Key: SPARK-4649
 URL: https://issues.apache.org/jira/browse/SPARK-4649
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Luca Foschini
Priority: Minor


PySpark has no equivalent of Scala's SchemaRDD.unionAll.
The standard SchemaRDD.union method downcasts the result to UnionRDD which 
makes it not amenable for chaining.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4648) Support Coalesce in Spark SQL.

2014-11-28 Thread Ravindra Pesala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravindra Pesala updated SPARK-4648:
---
Summary: Support Coalesce in Spark SQL.  (was: Use available Coalesce 
function in HiveQL instead of using HiveUDF. And support Coalesce in Spark SQL.)

 Support Coalesce in Spark SQL.
 --

 Key: SPARK-4648
 URL: https://issues.apache.org/jira/browse/SPARK-4648
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala

 Currently HiveQL uses Hive UDF function for Coalesce. Usually using hive udfs 
 are memory intensive. Since Coalesce function is already available in Spark , 
 we can make use of it. 
 And also support Coalesce function in Spar SQL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4648) Support Coalesce in Spark SQL.

2014-11-28 Thread Ravindra Pesala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravindra Pesala updated SPARK-4648:
---
Description: Support Coalesce function in Spark SQL  (was: Currently HiveQL 
uses Hive UDF function for Coalesce. Usually using hive udfs are memory 
intensive. Since Coalesce function is already available in Spark , we can make 
use of it. 
And also support Coalesce function in Spar SQL)

 Support Coalesce in Spark SQL.
 --

 Key: SPARK-4648
 URL: https://issues.apache.org/jira/browse/SPARK-4648
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala

 Support Coalesce function in Spark SQL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4650) Supporting multi column support in count(distinct c1,c2..) in Spark SQL

2014-11-28 Thread Ravindra Pesala (JIRA)
Ravindra Pesala created SPARK-4650:
--

 Summary: Supporting multi column support in count(distinct 
c1,c2..) in Spark SQL
 Key: SPARK-4650
 URL: https://issues.apache.org/jira/browse/SPARK-4650
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala


Support  multi column support inside count(distinct c1,c2..) which is not 
working in Spark SQL. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4575) Documentation for the pipeline features

2014-11-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228488#comment-14228488
 ] 

Joseph K. Bradley commented on SPARK-4575:
--

Perhaps this could take the form of 1 user guide section for the new API  
pipeline feature, plus subsections for existing algorithms which have been 
ported to the new spark.ml branch.

 Documentation for the pipeline features
 ---

 Key: SPARK-4575
 URL: https://issues.apache.org/jira/browse/SPARK-4575
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML, MLlib
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Add user guide for the newly added ML pipeline feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4650) Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL

2014-11-28 Thread Ravindra Pesala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravindra Pesala updated SPARK-4650:
---
Summary: Supporting multi column support in countDistinct function like 
count(distinct c1,c2..) in Spark SQL  (was: Supporting multi column support in 
count(distinct c1,c2..) in Spark SQL)

 Supporting multi column support in countDistinct function like count(distinct 
 c1,c2..) in Spark SQL
 ---

 Key: SPARK-4650
 URL: https://issues.apache.org/jira/browse/SPARK-4650
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala

 Support  multi column support inside count(distinct c1,c2..) which is not 
 working in Spark SQL. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2014-11-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228489#comment-14228489
 ] 

Patrick Wendell commented on SPARK-3694:


Yes we should print that too - I said that in the description. [~ilganeli] some 
other people are interested in working on this. Are you actively working on it?

 Allow printing object graph of tasks/RDD's with a debug flag
 

 Key: SPARK-3694
 URL: https://issues.apache.org/jira/browse/SPARK-3694
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
  Labels: starter

 This would be useful for debugging extra references inside of RDD's
 Here is an example for inspiration:
 http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
 We'd want to print this trace for both the RDD serialization inside of the 
 DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3694:
---
Comment: was deleted

(was: User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3091)

 Allow printing object graph of tasks/RDD's with a debug flag
 

 Key: SPARK-3694
 URL: https://issues.apache.org/jira/browse/SPARK-3694
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
  Labels: starter

 This would be useful for debugging extra references inside of RDD's
 Here is an example for inspiration:
 http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
 We'd want to print this trace for both the RDD serialization inside of the 
 DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4650) Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL

2014-11-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228491#comment-14228491
 ] 

Apache Spark commented on SPARK-4650:
-

User 'ravipesala' has created a pull request for this issue:
https://github.com/apache/spark/pull/3511

 Supporting multi column support in countDistinct function like count(distinct 
 c1,c2..) in Spark SQL
 ---

 Key: SPARK-4650
 URL: https://issues.apache.org/jira/browse/SPARK-4650
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala

 Support  multi column support inside count(distinct c1,c2..) which is not 
 working in Spark SQL. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization

2014-11-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228493#comment-14228493
 ] 

Patrick Wendell commented on SPARK-4349:


Hey Matt,

It turns out that parallel collections are not the only RDD where our sampled 
pre-emptive serialization trick can break. Other types of RDD's can have 
discrepancies in the partitions such that some could serialize properly and 
others don't. And I think those other cases are actually more serious than the 
parallel collections RDD case because parallelize() is mostly used for 
prototyping.  I've seen the more general issue affect production workloads so 
it would be good to fix. On top of this, we generally could stand to have 
better error reporting for failed serialization cases - (related work 
SPARK-3694).

In terms of solutions to this problem, it would be nice to find a solution that 
works in the general case. Matt - did you look at all about how complicated it 
would be to catch these errors are the time of serialization and propagate them 
up correctly such that the task set is aborted? This would be the most general 
and robust solution, although it could be complicated.

 Spark driver hangs on sc.parallelize() if exception is thrown during 
 serialization
 --

 Key: SPARK-4349
 URL: https://issues.apache.org/jira/browse/SPARK-4349
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Matt Cheah
 Fix For: 1.3.0


 Executing the following in the Spark Shell will lead to the Spark Shell 
 hanging after a stack trace is printed. The serializer is set to the Kryo 
 serializer.
 {code}
 scala import com.esotericsoftware.kryo.io.Input
 import com.esotericsoftware.kryo.io.Input
 scala import com.esotericsoftware.kryo.io.Output
 import com.esotericsoftware.kryo.io.Output
 scala class MyKryoSerializable extends 
 com.esotericsoftware.kryo.KryoSerializable { def write (kryo: 
 com.esotericsoftware.kryo.Kryo, output: Output) { throw new 
 com.esotericsoftware.kryo.KryoException; } ; def read (kryo: 
 com.esotericsoftware.kryo.Kryo, input: Input) { throw new 
 com.esotericsoftware.kryo.KryoException; } }
 defined class MyKryoSerializable
 scala sc.parallelize(Seq(new MyKryoSerializable, new 
 MyKryoSerializable)).collect
 {code}
 A stack trace is printed during serialization as expected, but another stack 
 trace is printed afterwards, indicating that the driver can't recover:
 {code}
 14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not 
 unique!
 akka.actor.PostRestartException: exception post restart (class 
 java.io.IOException)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
   at 
 akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247)
   at 
 akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76)
   at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369)
   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] 
 is not unique!
   at 
 akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130)
   at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77)
   at akka.actor.ActorCell.reserveChild(ActorCell.scala:369)
   at akka.actor.dungeon.Children$class.makeChild(Children.scala:202)
  

[jira] [Resolved] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4584.

Resolution: Fixed

 2x Performance regression for Spark-on-YARN
 ---

 Key: SPARK-4584
 URL: https://issues.apache.org/jira/browse/SPARK-4584
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Nishkam Ravi
Assignee: Marcelo Vanzin
Priority: Blocker

 Significant performance regression observed for Spark-on-YARN (upto 2x) after 
 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 
  from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
 enough input dataset in YARN cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4584:
---
Fix Version/s: 1.2.0

 2x Performance regression for Spark-on-YARN
 ---

 Key: SPARK-4584
 URL: https://issues.apache.org/jira/browse/SPARK-4584
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Nishkam Ravi
Assignee: Marcelo Vanzin
Priority: Blocker
 Fix For: 1.2.0


 Significant performance regression observed for Spark-on-YARN (upto 2x) after 
 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 
  from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
 enough input dataset in YARN cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks

2014-11-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228508#comment-14228508
 ] 

Josh Rosen commented on SPARK-4598:
---

[~meiyoula],

Do you have a sample job / workload that will let me reproduce this issue?  
Which Spark version are you using and how big is your driver memory?  Do you 
know if this is a regression from an earlier Spark version?

 Paginate stage page to avoid OOM with  100,000 tasks
 -

 Key: SPARK-4598
 URL: https://issues.apache.org/jira/browse/SPARK-4598
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Priority: Critical

 In HistoryServer stage page, clicking the task href in Description, it occurs 
 the GC error. The detail error message is:
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-352] | Error for 
 /history/application_1416206401491_0010/stages/stage/ | 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590)
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-364] | handle failed | 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697)
 java.lang.OutOfMemoryError: GC overhead limit exceeded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2985) Buffered data in BlockGenerator gets lost when receiver crashes

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2985.

Resolution: Invalid

I think this represents a misunderstanding of the internal API's

 Buffered data in BlockGenerator gets lost when receiver crashes
 ---

 Key: SPARK-2985
 URL: https://issues.apache.org/jira/browse/SPARK-2985
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0
Reporter: dai zhiyuan
Priority: Critical

 If recevierTracker crashes,the buffer data of BlockGenerator will be lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1450) Specify the default zone in the EC2 script help

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1450.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Sean Owen  (was: Tathagata Das)

 Specify the default zone in the EC2 script help
 ---

 Key: SPARK-1450
 URL: https://issues.apache.org/jira/browse/SPARK-1450
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0
Reporter: Tathagata Das
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2014-11-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4352:
---
Priority: Critical  (was: Major)

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Critical

 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4651) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile with newer versions of Hadoop

2014-11-28 Thread Tsuyoshi OZAWA (JIRA)
Tsuyoshi OZAWA created SPARK-4651:
-

 Summary: Adding -Phadoop-2.5 and -Phadoop-2.6 to compile with 
newer versions of Hadoop
 Key: SPARK-4651
 URL: https://issues.apache.org/jira/browse/SPARK-4651
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Tsuyoshi OZAWA


Currently, we don't have newer profiles to compile Spark with newer versions of 
Hadoop. We should have them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4651) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile with newer versions of Hadoop

2014-11-28 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228571#comment-14228571
 ] 

Tsuyoshi OZAWA commented on SPARK-4651:
---

I'll send PR via github soon.

 Adding -Phadoop-2.5 and -Phadoop-2.6 to compile with newer versions of Hadoop
 -

 Key: SPARK-4651
 URL: https://issues.apache.org/jira/browse/SPARK-4651
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Tsuyoshi OZAWA

 Currently, we don't have newer profiles to compile Spark with newer versions 
 of Hadoop. We should have them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4651) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop

2014-11-28 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated SPARK-4651:
--
Summary: Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer 
versions of Hadoop  (was: Adding -Phadoop-2.5 and -Phadoop-2.6 to compile with 
newer versions of Hadoop)

 Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of 
 Hadoop
 ---

 Key: SPARK-4651
 URL: https://issues.apache.org/jira/browse/SPARK-4651
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Tsuyoshi OZAWA

 Currently, we don't have newer profiles to compile Spark with newer versions 
 of Hadoop. We should have them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4651) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop

2014-11-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228573#comment-14228573
 ] 

Apache Spark commented on SPARK-4651:
-

User 'oza' has created a pull request for this issue:
https://github.com/apache/spark/pull/3512

 Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of 
 Hadoop
 ---

 Key: SPARK-4651
 URL: https://issues.apache.org/jira/browse/SPARK-4651
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Tsuyoshi OZAWA

 Currently, we don't have newer profiles to compile Spark with newer versions 
 of Hadoop. We should have them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4651) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop

2014-11-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228574#comment-14228574
 ] 

Sean Owen commented on SPARK-4651:
--

I don't agree with adding these profiles, as they would be identical to the 2.4 
profile. It's really a 2.4+ profile now. It's just more build complexity. 
There is no Hadoop 2.6 right now anyway.

 Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of 
 Hadoop
 ---

 Key: SPARK-4651
 URL: https://issues.apache.org/jira/browse/SPARK-4651
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Tsuyoshi OZAWA

 Currently, we don't have newer profiles to compile Spark with newer versions 
 of Hadoop. We should have them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4651) Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of Hadoop

2014-11-28 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228576#comment-14228576
 ] 

Tsuyoshi OZAWA commented on SPARK-4651:
---

[~srowen], oops, I've thought it's already released... anyway, I'll add 2.4+ 
profile. Thanks for your review!

 Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer versions of 
 Hadoop
 ---

 Key: SPARK-4651
 URL: https://issues.apache.org/jira/browse/SPARK-4651
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Tsuyoshi OZAWA

 Currently, we don't have newer profiles to compile Spark with newer versions 
 of Hadoop. We should have them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4651) Adding -Phadoop-2.4+ to compile Spark with newer versions of Hadoop

2014-11-28 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated SPARK-4651:
--
Summary: Adding -Phadoop-2.4+ to compile Spark with newer versions of 
Hadoop  (was: Adding -Phadoop-2.5 and -Phadoop-2.6 to compile Spark with newer 
versions of Hadoop)

 Adding -Phadoop-2.4+ to compile Spark with newer versions of Hadoop
 ---

 Key: SPARK-4651
 URL: https://issues.apache.org/jira/browse/SPARK-4651
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Tsuyoshi OZAWA

 Currently, we don't have newer profiles to compile Spark with newer versions 
 of Hadoop. We should have them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2014-11-28 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228580#comment-14228580
 ] 

Ilya Ganelin commented on SPARK-3694:
-

Hi Patrick - I am working on it - I am just trying to finalize a test for this. 

The reason I asked about task serialization is that in the description you talk 
about task serialization within the TaskSetManager, not the task serialization 
within the DAGScheduler - for the DAGScheduler you only mention RDD 
serialization. I wanted to confirm whether to print the task serialization for 
the DAGScheduler as well as the task serialization for the TaskSetManager. 

 Allow printing object graph of tasks/RDD's with a debug flag
 

 Key: SPARK-3694
 URL: https://issues.apache.org/jira/browse/SPARK-3694
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
  Labels: starter

 This would be useful for debugging extra references inside of RDD's
 Here is an example for inspiration:
 http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
 We'd want to print this trace for both the RDD serialization inside of the 
 DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4342) connection ack timeout improvement, replace Timer with ScheudledExecutor...

2014-11-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228590#comment-14228590
 ] 

Josh Rosen commented on SPARK-4342:
---

This looks like a duplicate of SPARK-4393, which has been fixed, so I've 
resolved this as duplicate.

 connection ack timeout improvement, replace Timer with ScheudledExecutor...
 ---

 Key: SPARK-4342
 URL: https://issues.apache.org/jira/browse/SPARK-4342
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Haitao Yao

 replace java.util.Timer with scheduledExecutorService, use message id 
 directly in the task.
 for details, see the mailing list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4635) Delete the val that never used in execute() of HashOuterJoin.

2014-11-28 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 closed SPARK-4635.
-
Resolution: Not a Problem

 Delete the val that never used in  execute() of HashOuterJoin.
 --

 Key: SPARK-4635
 URL: https://issues.apache.org/jira/browse/SPARK-4635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: DoingDone9
Priority: Minor

 The val boundCondition is created in execute(),but it never be used in 
 execute().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4342) connection ack timeout improvement, replace Timer with ScheudledExecutor...

2014-11-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4342.
---
Resolution: Duplicate

 connection ack timeout improvement, replace Timer with ScheudledExecutor...
 ---

 Key: SPARK-4342
 URL: https://issues.apache.org/jira/browse/SPARK-4342
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Haitao Yao

 replace java.util.Timer with scheduledExecutorService, use message id 
 directly in the task.
 for details, see the mailing list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4597) Use proper exception and reset variable in Utils.createTempDir() method

2014-11-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4597:
--
Summary: Use proper exception and reset variable in Utils.createTempDir() 
method  (was: Use proper exception and reset variable)

 Use proper exception and reset variable in Utils.createTempDir() method
 ---

 Key: SPARK-4597
 URL: https://issues.apache.org/jira/browse/SPARK-4597
 Project: Spark
  Issue Type: Bug
Reporter: Liang-Chi Hsieh
Priority: Minor

 In Utils.scala, File.exists() and File.mkdirs() only throw SecurityException 
 instead of IOException. Then, when an exception is thrown, the variable dir 
 should be reset too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4597) Use proper exception and reset variable in Utils.createTempDir() method

2014-11-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4597:
--
Affects Version/s: 1.2.0
   1.1.1
   1.0.2

 Use proper exception and reset variable in Utils.createTempDir() method
 ---

 Key: SPARK-4597
 URL: https://issues.apache.org/jira/browse/SPARK-4597
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Liang-Chi Hsieh
Priority: Minor

 In Utils.scala, File.exists() and File.mkdirs() only throw SecurityException 
 instead of IOException. Then, when an exception is thrown, the variable dir 
 should be reset too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4597) Use proper exception and reset variable in Utils.createTempDir() method

2014-11-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228593#comment-14228593
 ] 

Josh Rosen commented on SPARK-4597:
---

Resolved by https://github.com/apache/spark/pull/3449

 Use proper exception and reset variable in Utils.createTempDir() method
 ---

 Key: SPARK-4597
 URL: https://issues.apache.org/jira/browse/SPARK-4597
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.0.3, 1.1.2, 1.2.1


 In Utils.scala, File.exists() and File.mkdirs() only throw SecurityException 
 instead of IOException. Then, when an exception is thrown, the variable dir 
 should be reset too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4597) Use proper exception and reset variable in Utils.createTempDir() method

2014-11-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4597.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.1.2
   1.0.3
 Assignee: Liang-Chi Hsieh

 Use proper exception and reset variable in Utils.createTempDir() method
 ---

 Key: SPARK-4597
 URL: https://issues.apache.org/jira/browse/SPARK-4597
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.0.3, 1.1.2, 1.2.1


 In Utils.scala, File.exists() and File.mkdirs() only throw SecurityException 
 instead of IOException. Then, when an exception is thrown, the variable dir 
 should be reset too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2031) DAGScheduler supports pluggable clock

2014-11-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2031.
---
   Resolution: Fixed
Fix Version/s: 1.1.0

 DAGScheduler supports pluggable clock
 -

 Key: SPARK-2031
 URL: https://issues.apache.org/jira/browse/SPARK-2031
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.1, 1.0.0
Reporter: Chen Chao
Assignee: Chen Chao
 Fix For: 1.1.0


 DAGScheduler supports pluggable clock like what TaskSetManager does. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4082) Show Waiting/Queued Stages in Spark UI

2014-11-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228596#comment-14228596
 ] 

Josh Rosen commented on SPARK-4082:
---

My Jobs Page pull request added this on the per-job pages as lists of 
pending stages.  Does that address this issue, or do you think we should have 
a global list of pending / queued stages?

 Show Waiting/Queued Stages in Spark UI
 --

 Key: SPARK-4082
 URL: https://issues.apache.org/jira/browse/SPARK-4082
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Pat McDonough

 In the Stages UI page, It would be helpful to show the user any stages the 
 DAGScheduler has planned but are not yet active. Currently, this info is not 
 shown to the user in any way.
 /CC [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3003) FailedStage could not be cancelled by DAGScheduler when cancelJob or cancelStage

2014-11-28 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3003.
---
Resolution: Incomplete

Resolving as Incomplete since this couldn't be reproduced in a newer Spark 
version.

 FailedStage could not be cancelled by DAGScheduler when cancelJob or 
 cancelStage
 

 Key: SPARK-3003
 URL: https://issues.apache.org/jira/browse/SPARK-3003
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor

 Some stage is changed from running to failed, then DAGSCheduler could not  
 cancel it when cancelJob or cancelStage. Since in 
 failJobAndIndependentStages, DAGSCheduler will only cancel runningStage and 
 post SparkListenerStageCompleted for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2014-11-28 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228631#comment-14228631
 ] 

Ilya Ganelin commented on SPARK-3694:
-

Tests are completed and I will be submitting a pull request shortly. 

 Allow printing object graph of tasks/RDD's with a debug flag
 

 Key: SPARK-3694
 URL: https://issues.apache.org/jira/browse/SPARK-3694
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
  Labels: starter

 This would be useful for debugging extra references inside of RDD's
 Here is an example for inspiration:
 http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
 We'd want to print this trace for both the RDD serialization inside of the 
 DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4649) Add method unionAll to PySpark's SchemaRDD

2014-11-28 Thread Anant Daksh Asthana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228640#comment-14228640
 ] 

Anant Daksh Asthana commented on SPARK-4649:


I would like to take eon this task.

 Add method unionAll to PySpark's SchemaRDD 
 ---

 Key: SPARK-4649
 URL: https://issues.apache.org/jira/browse/SPARK-4649
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Luca Foschini
Priority: Minor

 PySpark has no equivalent of Scala's SchemaRDD.unionAll.
 The standard SchemaRDD.union method downcasts the result to UnionRDD which 
 makes it not amenable for chaining.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks

2014-11-28 Thread meiyoula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228646#comment-14228646
 ] 

meiyoula commented on SPARK-4598:
-

[~joshrosen],

I use the this two-day's github master code to test this, and just run the 
example SparkPi with default driver memory. It is the command: ./spark-submit 
--class org.apache.spark.examples.SparkPi --master yarn-client  
../lib/spark-examples*.jar 10

When the application is running and has executed 50,000 tasks, I open the 
stagepage in SparkUI, the web shutdown;
When the application is finished, I open the stagepage in HistoryServer, the 
web shutdown. Attention, the HistoryServer memory is also use default value.
  

 Paginate stage page to avoid OOM with  100,000 tasks
 -

 Key: SPARK-4598
 URL: https://issues.apache.org/jira/browse/SPARK-4598
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Priority: Critical

 In HistoryServer stage page, clicking the task href in Description, it occurs 
 the GC error. The detail error message is:
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-352] | Error for 
 /history/application_1416206401491_0010/stages/stage/ | 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590)
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-364] | handle failed | 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697)
 java.lang.OutOfMemoryError: GC overhead limit exceeded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org