[jira] [Assigned] (SPARK-7604) Python API for PCA and PCAModel
[ https://issues.apache.org/jira/browse/SPARK-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7604: --- Assignee: Apache Spark Python API for PCA and PCAModel --- Key: SPARK-7604 URL: https://issues.apache.org/jira/browse/SPARK-7604 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Assignee: Apache Spark Python API for org.apache.spark.mllib.feature.PCA and org.apache.spark.mllib.feature.PCAModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7604) Python API for PCA and PCAModel
[ https://issues.apache.org/jira/browse/SPARK-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7604: --- Assignee: (was: Apache Spark) Python API for PCA and PCAModel --- Key: SPARK-7604 URL: https://issues.apache.org/jira/browse/SPARK-7604 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Python API for org.apache.spark.mllib.feature.PCA and org.apache.spark.mllib.feature.PCAModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7604) Python API for PCA and PCAModel
[ https://issues.apache.org/jira/browse/SPARK-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553821#comment-14553821 ] Apache Spark commented on SPARK-7604: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/6315 Python API for PCA and PCAModel --- Key: SPARK-7604 URL: https://issues.apache.org/jira/browse/SPARK-7604 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Python API for org.apache.spark.mllib.feature.PCA and org.apache.spark.mllib.feature.PCAModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7322) Add DataFrame DSL for window function support
[ https://issues.apache.org/jira/browse/SPARK-7322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553865#comment-14553865 ] Reynold Xin commented on SPARK-7322: Also please make sure you add a unit test to JavaDataFrameSuite to make sure this is usable in Java. Add DataFrame DSL for window function support - Key: SPARK-7322 URL: https://issues.apache.org/jira/browse/SPARK-7322 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao Labels: DataFrame Here's a proposal for supporting window functions in the DataFrame DSL: 1. Add an over function to Column: {code} class Column { ... def over(window: Window): Column ... } {code} 2. Window: {code} object Window { def partitionBy(...): Window def orderBy(...): Window object Frame { def unbounded: Frame def preceding(n: Long): Frame def following(n: Long): Frame } class Frame } class Window { def orderBy(...): Window def rowsBetween(Frame, Frame): Window def rangeBetween(Frame, Frame): Window // maybe add this later } {code} Here's an example to use it: {code} df.select( avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”) .rowsBetween(Frame.unbounded, Frame.currentRow)) ) df.select( avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”) .rowsBetween(Frame.preceding(50), Frame.following(10))) ) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7717) Spark Standalone Web UI showing incorrect total memory, workers and cores
[ https://issues.apache.org/jira/browse/SPARK-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553890#comment-14553890 ] Apache Spark commented on SPARK-7717: - User 'zhichao-li' has created a pull request for this issue: https://github.com/apache/spark/pull/6317 Spark Standalone Web UI showing incorrect total memory, workers and cores - Key: SPARK-7717 URL: https://issues.apache.org/jira/browse/SPARK-7717 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.1 Environment: RedHat Reporter: Swaranga Sarma Priority: Minor Labels: web-ui Attachments: JIRA.PNG I launched a Spark master in standalone mode in one of my host and then launched 3 workers on three different hosts. The workers successfully connected to my master and the Web UI showed the correct details. Specifically, the Web UI correctly shows that the total memory and the total cores available for the cluster. However on one of the worker, I did a kill -9 worker process id and restarted the worker again. This time though, the master's Web UI shows incorrect total memory and number of cores. The total memory is shown to be 4*n, where n is the memory in each worker. Also the total workers is shown as 4 and the total number of cores shown is incorrect, it shows 4*c, where c is the number of cores on each worker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7717) Spark Standalone Web UI showing incorrect total memory, workers and cores
[ https://issues.apache.org/jira/browse/SPARK-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7717: --- Assignee: (was: Apache Spark) Spark Standalone Web UI showing incorrect total memory, workers and cores - Key: SPARK-7717 URL: https://issues.apache.org/jira/browse/SPARK-7717 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.1 Environment: RedHat Reporter: Swaranga Sarma Priority: Minor Labels: web-ui Attachments: JIRA.PNG I launched a Spark master in standalone mode in one of my host and then launched 3 workers on three different hosts. The workers successfully connected to my master and the Web UI showed the correct details. Specifically, the Web UI correctly shows that the total memory and the total cores available for the cluster. However on one of the worker, I did a kill -9 worker process id and restarted the worker again. This time though, the master's Web UI shows incorrect total memory and number of cores. The total memory is shown to be 4*n, where n is the memory in each worker. Also the total workers is shown as 4 and the total number of cores shown is incorrect, it shows 4*c, where c is the number of cores on each worker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7788) Streaming | Kinesis | KinesisReceiver blocks in onStart
[ https://issues.apache.org/jira/browse/SPARK-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7788: - Target Version/s: 1.4.0 Streaming | Kinesis | KinesisReceiver blocks in onStart --- Key: SPARK-7788 URL: https://issues.apache.org/jira/browse/SPARK-7788 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0, 1.3.1 Reporter: Aniket Bhatnagar Assignee: Tathagata Das Labels: kinesis KinesisReceiver calls worker.run() which is a blocking call (while loop) as per source code of kinesis-client library - https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java. This results in infinite loop while calling sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) perhaps because ReceiverTracker is never able to register the receiver (it's receiverInfo field is a empty map) causing it to be stuck in infinite loop while waiting for running flag to be set to false. Also, we should investigate a way to have receiver restart in case of failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7788) Streaming | Kinesis | KinesisReceiver blocks in onStart
[ https://issues.apache.org/jira/browse/SPARK-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7788: - Priority: Blocker (was: Major) Streaming | Kinesis | KinesisReceiver blocks in onStart --- Key: SPARK-7788 URL: https://issues.apache.org/jira/browse/SPARK-7788 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0, 1.3.1 Reporter: Aniket Bhatnagar Assignee: Tathagata Das Priority: Blocker Labels: kinesis KinesisReceiver calls worker.run() which is a blocking call (while loop) as per source code of kinesis-client library - https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java. This results in infinite loop while calling sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) perhaps because ReceiverTracker is never able to register the receiver (it's receiverInfo field is a empty map) causing it to be stuck in infinite loop while waiting for running flag to be set to false. Also, we should investigate a way to have receiver restart in case of failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7758) Failed to start thrift server when metastore is postgre sql
[ https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7758: --- Assignee: Apache Spark Failed to start thrift server when metastore is postgre sql --- Key: SPARK-7758 URL: https://issues.apache.org/jira/browse/SPARK-7758 Project: Spark Issue Type: Bug Components: SQL Reporter: Tao Wang Assignee: Apache Spark Priority: Critical Attachments: hive-site.xml, with error.log, with no error.log I am using today's master branch to start thrift server with setting metastore to postgre sql, and it shows error like: `15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE DELETEME1432125837197 CASCADE : Syntax error: Encountered CASCADE at line 1, column 34. java.sql.SQLSyntaxErrorException: Syntax error: Encountered CASCADE at line 1, column 34. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264) at org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)` But it works well with earlier master branch (on 7th, April). After printing their debug level log, I found current branch tries to connect with derby but didn't know why, maybe the big reconstructure in sql module cause this issue. The Datastore shows in current branch: 15/05/20 20:43:57 DEBUG Datastore: === Datastore = 15/05/20 20:43:57 DEBUG Datastore: StoreManager : rdbms (org.datanucleus.store.rdbms.RDBMSStoreManager) 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), Validate(None) 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, STOREDPROC] 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0 15/05/20 20:43:57 DEBUG Datastore: === 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter 15/05/20 20:43:57 DEBUG Datastore: Datastore : name=Apache Derby version=10.10.1.1 - (1458268) 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name=Apache Derby Embedded JDBC Driver version=10.10.1.1 - (1458268) 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : factory=datanucleus1 case=UPPERCASE catalog= schema=SPARK 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : MixedCase UPPERCASE MixedCase-Sensitive 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : Table=128 Column=128 Constraint=128 Index=128 Delimiter= 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : catalog=false schema=true 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes (max-batch-size=50) 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, type=forward-only, concurrency=read-only 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, VARBINARY, JAVA_OBJECT 15/05/20 20:43:57 DEBUG Datastore: === The Datastore in earlier master branch: 15/05/20 20:18:10 DEBUG Datastore: === Datastore = 15/05/20 20:18:10 DEBUG Datastore: StoreManager : rdbms
[jira] [Created] (SPARK-7787) SerializableAWSCredentials in KinesisReceiver cannot be deserialized
Tathagata Das created SPARK-7787: Summary: SerializableAWSCredentials in KinesisReceiver cannot be deserialized Key: SPARK-7787 URL: https://issues.apache.org/jira/browse/SPARK-7787 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Lack of default constructor causes deserialization to fail. This occurs only when the AWS credentials are explicitly specified through KinesisUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7787) SerializableAWSCredentials in KinesisReceiver cannot be deserialized
[ https://issues.apache.org/jira/browse/SPARK-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553841#comment-14553841 ] Apache Spark commented on SPARK-7787: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/6316 SerializableAWSCredentials in KinesisReceiver cannot be deserialized - Key: SPARK-7787 URL: https://issues.apache.org/jira/browse/SPARK-7787 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Lack of default constructor causes deserialization to fail. This occurs only when the AWS credentials are explicitly specified through KinesisUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7787) SerializableAWSCredentials in KinesisReceiver cannot be deserialized
[ https://issues.apache.org/jira/browse/SPARK-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7787: --- Assignee: Tathagata Das (was: Apache Spark) SerializableAWSCredentials in KinesisReceiver cannot be deserialized - Key: SPARK-7787 URL: https://issues.apache.org/jira/browse/SPARK-7787 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Lack of default constructor causes deserialization to fail. This occurs only when the AWS credentials are explicitly specified through KinesisUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7787) SerializableAWSCredentials in KinesisReceiver cannot be deserialized
[ https://issues.apache.org/jira/browse/SPARK-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7787: --- Assignee: Apache Spark (was: Tathagata Das) SerializableAWSCredentials in KinesisReceiver cannot be deserialized - Key: SPARK-7787 URL: https://issues.apache.org/jira/browse/SPARK-7787 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Apache Spark Priority: Blocker Lack of default constructor causes deserialization to fail. This occurs only when the AWS credentials are explicitly specified through KinesisUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7754) Use PartialFunction literals instead of objects in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553849#comment-14553849 ] Edoardo Vacchi edited comment on SPARK-7754 at 5/21/15 8:15 AM: That's a fair observation; on the other hand: 1) is the code concerning rule application expecting throws at all? 2) RuleExecutor traces rules by their ruleName; if you really need named rules, you could use `rule.named(...) { }}` this does not solve the stack trace problem, but you'll still have meaningful information in the log was (Author: evacchi): That's a fair observation; on the other hand: 1) is the code concerning rule application expecting throws at all? 2) RuleExecutor traces rules by their ruleName; if you really need named rules, you could use {{ rule.named(...) { } }} this does not solve the stack trace problem, but you'll still have meaningful information in the log Use PartialFunction literals instead of objects in Catalyst --- Key: SPARK-7754 URL: https://issues.apache.org/jira/browse/SPARK-7754 Project: Spark Issue Type: Improvement Reporter: Edoardo Vacchi Catalyst rules extend two distinct rule types: {{Rule[LogicalPlan]}} and {{Strategy}} (which is an alias for {{GenericStrategy[SparkPlan]}}). The distinction is fairly subtle: in the end, both rule types are supposed to define a method {{apply(plan: LogicalPlan)}} (where LogicalPlan is either Logical- or Spark-) which returns a transformed plan (or a sequence thereof, in the case of Strategy). Ceremonies asides, the body of such method is always of the kind: {code:java} def apply(plan: PlanType) = plan match pf {code} where `pf` would be some `PartialFunction` of the PlanType: {code:java} val pf = { case ... = ... } {code} This is JIRA is a proposal to introduce utility methods to a) reduce the boilerplate to define rewrite rules b) turning them back into what they essentially represent: function types. These changes would be backwards compatible, and would greatly help in understanding what the code does. Current use of objects is redundant and possibly confusing. *{{Rule[LogicalPlan]}}* a) Introduce the utility object {code:java} object rule def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): Rule[LogicalPlan] = new Rule[LogicalPlan] { def apply (plan: LogicalPlan): LogicalPlan = plan transform pf } def named(name: String)(pf: PartialFunction[LogicalPlan, LogicalPlan]): Rule[LogicalPlan] = new Rule[LogicalPlan] { override val ruleName = name def apply (plan: LogicalPlan): LogicalPlan = plan transform pf } {code} b) progressively replace the boilerplate-y object definitions; e.g. {code:java} object MyRewriteRule extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case ... = ... } {code} with {code:java} // define a Rule[LogicalPlan] val MyRewriteRule = rule { case ... = ... } {code} and/or : {code:java} // define a named Rule[LogicalPlan] val MyRewriteRule = rule.named(My rewrite rule) { case ... = ... } {code} *Strategies* A similar solution could be applied to shorten the code for Strategies, which are total functions only because they are all supposed to manage the default case, possibly returning `Nil`. In this case we might introduce the following utility: {code:java} object strategy { /** * Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan]. * The partial function must therefore return *one single* SparkPlan for each case. * The method will automatically wrap them in a [[Seq]]. * Unhandled cases will automatically return Seq.empty */ def apply(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy = new Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty } /** * Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] ]. * The partial function must therefore return a Seq[SparkPlan] for each case. * Unhandled cases will automatically return Seq.empty */ def seq(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy = new Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan] } } {code} Usage: {code:java} val mystrategy = strategy { case ... = ... } val seqstrategy = strategy.seq { case ... = ... } {code} *Further possible improvements:* Making the utility methods `implicit`, thereby further reducing the rewrite rules to: {code:java} //
[jira] [Created] (SPARK-7788) Streaming | Kinesis | KinesisReceiver blocks in onStart
Aniket Bhatnagar created SPARK-7788: --- Summary: Streaming | Kinesis | KinesisReceiver blocks in onStart Key: SPARK-7788 URL: https://issues.apache.org/jira/browse/SPARK-7788 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1, 1.3.0 Reporter: Aniket Bhatnagar KinesisReceiver calls worker.run() which is a blocking call (while loop) as per source code of kinesis-client library - https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java. This results in infinite loop while calling sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) perhaps because ReceiverTracker is never able to register the receiver (it's receiverInfo field is a empty map) causing it to be stuck in infinite loop while waiting for running flag to be set to false. Also, we should investigate a way to have receiver restart in case of failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7788) Streaming | Kinesis | KinesisReceiver blocks in onStart
[ https://issues.apache.org/jira/browse/SPARK-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-7788: Assignee: Tathagata Das Streaming | Kinesis | KinesisReceiver blocks in onStart --- Key: SPARK-7788 URL: https://issues.apache.org/jira/browse/SPARK-7788 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0, 1.3.1 Reporter: Aniket Bhatnagar Assignee: Tathagata Das Labels: kinesis KinesisReceiver calls worker.run() which is a blocking call (while loop) as per source code of kinesis-client library - https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java. This results in infinite loop while calling sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) perhaps because ReceiverTracker is never able to register the receiver (it's receiverInfo field is a empty map) causing it to be stuck in infinite loop while waiting for running flag to be set to false. Also, we should investigate a way to have receiver restart in case of failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7717) Spark Standalone Web UI showing incorrect total memory, workers and cores
[ https://issues.apache.org/jira/browse/SPARK-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7717: --- Assignee: Apache Spark Spark Standalone Web UI showing incorrect total memory, workers and cores - Key: SPARK-7717 URL: https://issues.apache.org/jira/browse/SPARK-7717 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.1 Environment: RedHat Reporter: Swaranga Sarma Assignee: Apache Spark Priority: Minor Labels: web-ui Attachments: JIRA.PNG I launched a Spark master in standalone mode in one of my host and then launched 3 workers on three different hosts. The workers successfully connected to my master and the Web UI showed the correct details. Specifically, the Web UI correctly shows that the total memory and the total cores available for the cluster. However on one of the worker, I did a kill -9 worker process id and restarted the worker again. This time though, the master's Web UI shows incorrect total memory and number of cores. The total memory is shown to be 4*n, where n is the memory in each worker. Also the total workers is shown as 4 and the total number of cores shown is incorrect, it shows 4*c, where c is the number of cores on each worker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7765) Input vector should divide with the norm in Word2Vec's findSynonyms
[ https://issues.apache.org/jira/browse/SPARK-7765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553889#comment-14553889 ] Liang-Chi Hsieh commented on SPARK-7765: I will try to close some obsolete PRs. Some PRs are there waiting for related APIs to be ready or reviewing from others. I usually will update them soon if they get responses. Input vector should divide with the norm in Word2Vec's findSynonyms --- Key: SPARK-7765 URL: https://issues.apache.org/jira/browse/SPARK-7765 Project: Spark Issue Type: Bug Components: MLlib Reporter: Liang-Chi Hsieh In Word2Vec's findSynonyms, the computed cosine similarities should divide with the norm of the given vector since it is not normalized. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7758) Failed to start thrift server when metastore is postgre sql
[ https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553791#comment-14553791 ] Apache Spark commented on SPARK-7758: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/6314 Failed to start thrift server when metastore is postgre sql --- Key: SPARK-7758 URL: https://issues.apache.org/jira/browse/SPARK-7758 Project: Spark Issue Type: Bug Components: SQL Reporter: Tao Wang Priority: Critical Attachments: hive-site.xml, with error.log, with no error.log I am using today's master branch to start thrift server with setting metastore to postgre sql, and it shows error like: `15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE DELETEME1432125837197 CASCADE : Syntax error: Encountered CASCADE at line 1, column 34. java.sql.SQLSyntaxErrorException: Syntax error: Encountered CASCADE at line 1, column 34. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264) at org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)` But it works well with earlier master branch (on 7th, April). After printing their debug level log, I found current branch tries to connect with derby but didn't know why, maybe the big reconstructure in sql module cause this issue. The Datastore shows in current branch: 15/05/20 20:43:57 DEBUG Datastore: === Datastore = 15/05/20 20:43:57 DEBUG Datastore: StoreManager : rdbms (org.datanucleus.store.rdbms.RDBMSStoreManager) 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), Validate(None) 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, STOREDPROC] 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0 15/05/20 20:43:57 DEBUG Datastore: === 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter 15/05/20 20:43:57 DEBUG Datastore: Datastore : name=Apache Derby version=10.10.1.1 - (1458268) 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name=Apache Derby Embedded JDBC Driver version=10.10.1.1 - (1458268) 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : factory=datanucleus1 case=UPPERCASE catalog= schema=SPARK 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : MixedCase UPPERCASE MixedCase-Sensitive 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : Table=128 Column=128 Constraint=128 Index=128 Delimiter= 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : catalog=false schema=true 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes (max-batch-size=50) 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, type=forward-only, concurrency=read-only 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, VARBINARY, JAVA_OBJECT 15/05/20 20:43:57 DEBUG Datastore: === The Datastore in earlier master branch: 15/05/20 20:18:10 DEBUG Datastore: === Datastore
[jira] [Assigned] (SPARK-7758) Failed to start thrift server when metastore is postgre sql
[ https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7758: --- Assignee: (was: Apache Spark) Failed to start thrift server when metastore is postgre sql --- Key: SPARK-7758 URL: https://issues.apache.org/jira/browse/SPARK-7758 Project: Spark Issue Type: Bug Components: SQL Reporter: Tao Wang Priority: Critical Attachments: hive-site.xml, with error.log, with no error.log I am using today's master branch to start thrift server with setting metastore to postgre sql, and it shows error like: `15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE DELETEME1432125837197 CASCADE : Syntax error: Encountered CASCADE at line 1, column 34. java.sql.SQLSyntaxErrorException: Syntax error: Encountered CASCADE at line 1, column 34. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264) at org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)` But it works well with earlier master branch (on 7th, April). After printing their debug level log, I found current branch tries to connect with derby but didn't know why, maybe the big reconstructure in sql module cause this issue. The Datastore shows in current branch: 15/05/20 20:43:57 DEBUG Datastore: === Datastore = 15/05/20 20:43:57 DEBUG Datastore: StoreManager : rdbms (org.datanucleus.store.rdbms.RDBMSStoreManager) 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), Validate(None) 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, STOREDPROC] 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0 15/05/20 20:43:57 DEBUG Datastore: === 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter 15/05/20 20:43:57 DEBUG Datastore: Datastore : name=Apache Derby version=10.10.1.1 - (1458268) 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name=Apache Derby Embedded JDBC Driver version=10.10.1.1 - (1458268) 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : factory=datanucleus1 case=UPPERCASE catalog= schema=SPARK 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : MixedCase UPPERCASE MixedCase-Sensitive 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : Table=128 Column=128 Constraint=128 Index=128 Delimiter= 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : catalog=false schema=true 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes (max-batch-size=50) 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, type=forward-only, concurrency=read-only 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, VARBINARY, JAVA_OBJECT 15/05/20 20:43:57 DEBUG Datastore: === The Datastore in earlier master branch: 15/05/20 20:18:10 DEBUG Datastore: === Datastore = 15/05/20 20:18:10 DEBUG Datastore: StoreManager : rdbms
[jira] [Created] (SPARK-7786) Allow StreamingListener to be specified in SparkConf and loaded when creating StreamingContext
yangping wu created SPARK-7786: -- Summary: Allow StreamingListener to be specified in SparkConf and loaded when creating StreamingContext Key: SPARK-7786 URL: https://issues.apache.org/jira/browse/SPARK-7786 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.3.1 Reporter: yangping wu As mentioned in [SPARK-5411|https://issues.apache.org/jira/browse/SPARK-5411], We can also allow user to register StreamingListener through SparkConf settings, and loaded when creating StreamingContext, This would allow monitoring frameworks to be easily injected into Spark programs without having to modify those programs' code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7786) Allow StreamingListener to be specified in SparkConf and loaded when creating StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated SPARK-7786: --- Priority: Minor (was: Major) Allow StreamingListener to be specified in SparkConf and loaded when creating StreamingContext -- Key: SPARK-7786 URL: https://issues.apache.org/jira/browse/SPARK-7786 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.3.1 Reporter: yangping wu Priority: Minor As mentioned in [SPARK-5411|https://issues.apache.org/jira/browse/SPARK-5411], We can also allow user to register StreamingListener through SparkConf settings, and loaded when creating StreamingContext, This would allow monitoring frameworks to be easily injected into Spark programs without having to modify those programs' code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7754) Use PartialFunction literals instead of objects in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553849#comment-14553849 ] Edoardo Vacchi commented on SPARK-7754: --- That's a fair observation; on the other hand: 1) is the code concerning rule application expecting throws at all? 2) RuleExecutor traces rules by their ruleName; if you really need named rules, you could use {{ rule.named(...) { } }} this does not solve the stack trace problem, but you'll still have meaningful information in the log Use PartialFunction literals instead of objects in Catalyst --- Key: SPARK-7754 URL: https://issues.apache.org/jira/browse/SPARK-7754 Project: Spark Issue Type: Improvement Reporter: Edoardo Vacchi Catalyst rules extend two distinct rule types: {{Rule[LogicalPlan]}} and {{Strategy}} (which is an alias for {{GenericStrategy[SparkPlan]}}). The distinction is fairly subtle: in the end, both rule types are supposed to define a method {{apply(plan: LogicalPlan)}} (where LogicalPlan is either Logical- or Spark-) which returns a transformed plan (or a sequence thereof, in the case of Strategy). Ceremonies asides, the body of such method is always of the kind: {code:java} def apply(plan: PlanType) = plan match pf {code} where `pf` would be some `PartialFunction` of the PlanType: {code:java} val pf = { case ... = ... } {code} This is JIRA is a proposal to introduce utility methods to a) reduce the boilerplate to define rewrite rules b) turning them back into what they essentially represent: function types. These changes would be backwards compatible, and would greatly help in understanding what the code does. Current use of objects is redundant and possibly confusing. *{{Rule[LogicalPlan]}}* a) Introduce the utility object {code:java} object rule def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): Rule[LogicalPlan] = new Rule[LogicalPlan] { def apply (plan: LogicalPlan): LogicalPlan = plan transform pf } def named(name: String)(pf: PartialFunction[LogicalPlan, LogicalPlan]): Rule[LogicalPlan] = new Rule[LogicalPlan] { override val ruleName = name def apply (plan: LogicalPlan): LogicalPlan = plan transform pf } {code} b) progressively replace the boilerplate-y object definitions; e.g. {code:java} object MyRewriteRule extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case ... = ... } {code} with {code:java} // define a Rule[LogicalPlan] val MyRewriteRule = rule { case ... = ... } {code} and/or : {code:java} // define a named Rule[LogicalPlan] val MyRewriteRule = rule.named(My rewrite rule) { case ... = ... } {code} *Strategies* A similar solution could be applied to shorten the code for Strategies, which are total functions only because they are all supposed to manage the default case, possibly returning `Nil`. In this case we might introduce the following utility: {code:java} object strategy { /** * Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan]. * The partial function must therefore return *one single* SparkPlan for each case. * The method will automatically wrap them in a [[Seq]]. * Unhandled cases will automatically return Seq.empty */ def apply(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy = new Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty } /** * Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] ]. * The partial function must therefore return a Seq[SparkPlan] for each case. * Unhandled cases will automatically return Seq.empty */ def seq(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy = new Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan] } } {code} Usage: {code:java} val mystrategy = strategy { case ... = ... } val seqstrategy = strategy.seq { case ... = ... } {code} *Further possible improvements:* Making the utility methods `implicit`, thereby further reducing the rewrite rules to: {code:java} // define a PartialFunction[LogicalPlan, LogicalPlan] // the implicit would convert it into a Rule[LogicalPlan] at the use sites val MyRewriteRule = { case ... = ... } {code} *Caveats* Because of the way objects are initialized vs. vals, it might be necessary reorder instructions so that vals are actually initialized before they are used. E.g.: {code:java} class MyOptimizer extends
[jira] [Updated] (SPARK-7322) Add DataFrame DSL for window function support
[ https://issues.apache.org/jira/browse/SPARK-7322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7322: --- Description: Here's a proposal for supporting window functions in the DataFrame DSL: 1. Add an over function to Column: {code} class Column { ... def over(window: Window): Column ... } {code} 2. Window: {code} object Window { def partitionBy(...): Window def orderBy(...): Window object Frame { def unbounded: Frame def preceding(n: Long): Frame def following(n: Long): Frame } class Frame } class Window { def orderBy(...): Window def rowsBetween(Frame, Frame): Window def rangeBetween(Frame, Frame): Window // maybe add this later } {code} Here's an example to use it: {code} df.select( avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”) .rowsBetween(Frame.unbounded, Frame.currentRow)) ) df.select( avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”) .rowsBetween(Frame.preceding(50), Frame.following(10))) ) {code} was: Here's a proposal for supporting window functions in the DataFrame DSL: 1. Add an over function to Column: {code} class Column { ... def over(): WindowFunctionSpec ... } {code} 2. WindowFunctionSpec: {code} // By default frame = full partition class WindowFunctionSpec { def partitionBy(cols: Column*): WindowFunctionSpec def orderBy(cols: Column*): WindowFunctionSpec // restrict frame beginning from current row - n position def rowsPreceding(n: Int): WindowFunctionSpec // restrict frame ending from current row - n position def rowsFollowing(n: Int): WindowFunctionSpec def rangePreceding(n: Int): WindowFunctionSpec def rowsFollowing(n: Int): WindowFunctionSpec } {code} Here's an example to use it: {code} df.select( df.store, df.date, df.sales, avg(df.sales).over.partitionBy(df.store) .orderBy(df.store) .rowsFollowing(0)// this means from unbounded preceding to current row ) {code} Add DataFrame DSL for window function support - Key: SPARK-7322 URL: https://issues.apache.org/jira/browse/SPARK-7322 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao Labels: DataFrame Here's a proposal for supporting window functions in the DataFrame DSL: 1. Add an over function to Column: {code} class Column { ... def over(window: Window): Column ... } {code} 2. Window: {code} object Window { def partitionBy(...): Window def orderBy(...): Window object Frame { def unbounded: Frame def preceding(n: Long): Frame def following(n: Long): Frame } class Frame } class Window { def orderBy(...): Window def rowsBetween(Frame, Frame): Window def rangeBetween(Frame, Frame): Window // maybe add this later } {code} Here's an example to use it: {code} df.select( avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”) .rowsBetween(Frame.unbounded, Frame.currentRow)) ) df.select( avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”) .rowsBetween(Frame.preceding(50), Frame.following(10))) ) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7754) Use PartialFunction literals instead of objects in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553849#comment-14553849 ] Edoardo Vacchi edited comment on SPARK-7754 at 5/21/15 8:24 AM: That's a fair observation; on the other hand: 1) is the code concerning rule application expecting throws at all? 2) RuleExecutor traces rules by their ruleName; if you really need named rules, you could use `rule.named(...) { }}` this does not solve the stack trace problem, but you'll still have meaningful information in the log also, this could be partially worked around by carefully modularizing rules within container objects/traits was (Author: evacchi): That's a fair observation; on the other hand: 1) is the code concerning rule application expecting throws at all? 2) RuleExecutor traces rules by their ruleName; if you really need named rules, you could use `rule.named(...) { }}` this does not solve the stack trace problem, but you'll still have meaningful information in the log Use PartialFunction literals instead of objects in Catalyst --- Key: SPARK-7754 URL: https://issues.apache.org/jira/browse/SPARK-7754 Project: Spark Issue Type: Improvement Reporter: Edoardo Vacchi Catalyst rules extend two distinct rule types: {{Rule[LogicalPlan]}} and {{Strategy}} (which is an alias for {{GenericStrategy[SparkPlan]}}). The distinction is fairly subtle: in the end, both rule types are supposed to define a method {{apply(plan: LogicalPlan)}} (where LogicalPlan is either Logical- or Spark-) which returns a transformed plan (or a sequence thereof, in the case of Strategy). Ceremonies asides, the body of such method is always of the kind: {code:java} def apply(plan: PlanType) = plan match pf {code} where `pf` would be some `PartialFunction` of the PlanType: {code:java} val pf = { case ... = ... } {code} This is JIRA is a proposal to introduce utility methods to a) reduce the boilerplate to define rewrite rules b) turning them back into what they essentially represent: function types. These changes would be backwards compatible, and would greatly help in understanding what the code does. Current use of objects is redundant and possibly confusing. *{{Rule[LogicalPlan]}}* a) Introduce the utility object {code:java} object rule def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): Rule[LogicalPlan] = new Rule[LogicalPlan] { def apply (plan: LogicalPlan): LogicalPlan = plan transform pf } def named(name: String)(pf: PartialFunction[LogicalPlan, LogicalPlan]): Rule[LogicalPlan] = new Rule[LogicalPlan] { override val ruleName = name def apply (plan: LogicalPlan): LogicalPlan = plan transform pf } {code} b) progressively replace the boilerplate-y object definitions; e.g. {code:java} object MyRewriteRule extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case ... = ... } {code} with {code:java} // define a Rule[LogicalPlan] val MyRewriteRule = rule { case ... = ... } {code} and/or : {code:java} // define a named Rule[LogicalPlan] val MyRewriteRule = rule.named(My rewrite rule) { case ... = ... } {code} *Strategies* A similar solution could be applied to shorten the code for Strategies, which are total functions only because they are all supposed to manage the default case, possibly returning `Nil`. In this case we might introduce the following utility: {code:java} object strategy { /** * Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan]. * The partial function must therefore return *one single* SparkPlan for each case. * The method will automatically wrap them in a [[Seq]]. * Unhandled cases will automatically return Seq.empty */ def apply(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy = new Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty } /** * Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] ]. * The partial function must therefore return a Seq[SparkPlan] for each case. * Unhandled cases will automatically return Seq.empty */ def seq(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy = new Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan] } } {code} Usage: {code:java} val mystrategy = strategy { case ... = ... } val seqstrategy = strategy.seq { case ... = ... } {code} *Further possible improvements:*
[jira] [Commented] (SPARK-7788) Streaming | Kinesis | KinesisReceiver blocks in onStart
[ https://issues.apache.org/jira/browse/SPARK-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553893#comment-14553893 ] Tathagata Das commented on SPARK-7788: -- This may be a good catch. I will try to look into it very soon and try to solve it for 1.4 Streaming | Kinesis | KinesisReceiver blocks in onStart --- Key: SPARK-7788 URL: https://issues.apache.org/jira/browse/SPARK-7788 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0, 1.3.1 Reporter: Aniket Bhatnagar Assignee: Tathagata Das Priority: Blocker Labels: kinesis KinesisReceiver calls worker.run() which is a blocking call (while loop) as per source code of kinesis-client library - https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java. This results in infinite loop while calling sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) perhaps because ReceiverTracker is never able to register the receiver (it's receiverInfo field is a empty map) causing it to be stuck in infinite loop while waiting for running flag to be set to false. Also, we should investigate a way to have receiver restart in case of failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7606) Document all PySpark SQL/DataFrame public methods with @since tag
[ https://issues.apache.org/jira/browse/SPARK-7606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7606. Resolution: Fixed Fix Version/s: 1.4.0 Document all PySpark SQL/DataFrame public methods with @since tag - Key: SPARK-7606 URL: https://issues.apache.org/jira/browse/SPARK-7606 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Nicholas Chammas Assignee: Davies Liu Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7784) Check 1.3- 1.4 SQL API compliance using java-compliance-checker
Xiangrui Meng created SPARK-7784: Summary: Check 1.3- 1.4 SQL API compliance using java-compliance-checker Key: SPARK-7784 URL: https://issues.apache.org/jira/browse/SPARK-7784 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553733#comment-14553733 ] Kaveen Raajan commented on SPARK-5389: -- I had the same problem. I found the root cause for this issue. Your JAVA_HOME is not set correctly, Please set this in environment variable correctly. Make sure your java path is not containing any space. OS - Windows 8 and Windows server 2008 spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell, Windows Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG, spark_bug.png spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6013) Add more Python ML examples for spark.ml
[ https://issues.apache.org/jira/browse/SPARK-6013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Sriharsha reassigned SPARK-6013: Assignee: Ram Sriharsha Add more Python ML examples for spark.ml Key: SPARK-6013 URL: https://issues.apache.org/jira/browse/SPARK-6013 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Ram Sriharsha Now that the spark.ml Pipelines API is supported within Python, we should duplicate the remaining Scala/Java spark.ml examples within Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7784) Check 1.3- 1.4 SQL API compliance using java-compliance-checker
[ https://issues.apache.org/jira/browse/SPARK-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7784: - Attachment: compat_report.html There are lots of false positives since package private classes/methods are public in Java. But at least there are no false negatives. Check 1.3- 1.4 SQL API compliance using java-compliance-checker Key: SPARK-7784 URL: https://issues.apache.org/jira/browse/SPARK-7784 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Attachments: compat_report.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7753) Improve kernel density API
[ https://issues.apache.org/jira/browse/SPARK-7753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7753. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6279 [https://github.com/apache/spark/pull/6279] Improve kernel density API -- Key: SPARK-7753 URL: https://issues.apache.org/jira/browse/SPARK-7753 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 Kernel density estimation is provided in many statistics libraries: http://en.wikipedia.org/wiki/Kernel_density_estimation#Statistical_implementation. We should make sure that we implement a similar API. The two most important parameters of kernel density estimation are kernel type and bandwidth. Besides density estimation, it is also used for smoothing. The current API is designed only for Gaussian kernel and density estimation: {code} def kernelDensity(samples: RDD[Double], standardDeviation: Double, evaluationPoints: Iterable[Double]): Array[Double] {code} It would be nice if we can come up with an extensible API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7784) Check 1.3- 1.4 SQL API compliance using java-compliance-checker
[ https://issues.apache.org/jira/browse/SPARK-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7784. Resolution: Fixed Fix Version/s: 1.4.0 I went through them quickly. Everything looks fine. Check 1.3- 1.4 SQL API compliance using java-compliance-checker Key: SPARK-7784 URL: https://issues.apache.org/jira/browse/SPARK-7784 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 Attachments: compat_report.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7785) Add missing items to pyspark.mllib.linalg.Matrices
Manoj Kumar created SPARK-7785: -- Summary: Add missing items to pyspark.mllib.linalg.Matrices Key: SPARK-7785 URL: https://issues.apache.org/jira/browse/SPARK-7785 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar For DenseMatrices. Class Methods __str__, transpose Object Methods zeros, ones, eye, rand, randn, diag For SparseMatrices Class Methods __str__, transpose Object Methods, fromCoo, speye, sprand, sprandn, spdiag, Matrices Methods, horzcat, vertcat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7785) Add missing items to pyspark.mllib.linalg.Matrices
[ https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553769#comment-14553769 ] Manoj Kumar commented on SPARK-7785: ping [~josephkb] Add missing items to pyspark.mllib.linalg.Matrices -- Key: SPARK-7785 URL: https://issues.apache.org/jira/browse/SPARK-7785 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar For DenseMatrices. Class Methods __str__, transpose Object Methods zeros, ones, eye, rand, randn, diag For SparseMatrices Class Methods __str__, transpose Object Methods, fromCoo, speye, sprand, sprandn, spdiag, Matrices Methods, horzcat, vertcat -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7320) Add rollup and cube support to DataFrame Java/Scala DSL
[ https://issues.apache.org/jira/browse/SPARK-7320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553696#comment-14553696 ] Apache Spark commented on SPARK-7320: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/6312 Add rollup and cube support to DataFrame Java/Scala DSL --- Key: SPARK-7320 URL: https://issues.apache.org/jira/browse/SPARK-7320 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao Labels: starter Fix For: 1.4.0 We should add two functions to GroupedData in order to support rollup and cube for the DataFrame DSL. {code} def rollup(): GroupedData def cube(): GroupedData {code} These two should return new GroupedData with the appropriate state set so when we run an Aggregate, we translate the underlying logical operator into Rollup or Cube. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6197) handle json parse exception for eventlog file not finished writing
[ https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553703#comment-14553703 ] Xia Hu commented on SPARK-6197: --- Ok, get it. Thank you very much~~ handle json parse exception for eventlog file not finished writing --- Key: SPARK-6197 URL: https://issues.apache.org/jira/browse/SPARK-6197 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Zhang, Liye Assignee: Zhang, Liye Priority: Minor Labels: backport-needed Fix For: 1.4.0 This is a following JIRA for [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can display event log files that with suffix *.inprogress*. However, the eventlog file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In which case, the file maybe truncated in the last line, leading to the line being not in valid Json format. Which will cause Json parse exception when reading the file. For this case, we can just ignore the last line content, since the history for abnormal cases showed on web is only a reference for user, it can demonstrate the past status of the app before terminated abnormally (we can not guarantee the history can show exactly the last moment when app encounter the abnormal situation). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7394) Add Pandas style cast (astype)
[ https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7394: --- Assignee: Apache Spark Add Pandas style cast (astype) -- Key: SPARK-7394 URL: https://issues.apache.org/jira/browse/SPARK-7394 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Labels: starter Basically alias astype == cast in Column for Python (and Python only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7394) Add Pandas style cast (astype)
[ https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7394: --- Assignee: (was: Apache Spark) Add Pandas style cast (astype) -- Key: SPARK-7394 URL: https://issues.apache.org/jira/browse/SPARK-7394 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter Basically alias astype == cast in Column for Python (and Python only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7394) Add Pandas style cast (astype)
[ https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553772#comment-14553772 ] Apache Spark commented on SPARK-7394: - User 'kaka1992' has created a pull request for this issue: https://github.com/apache/spark/pull/6313 Add Pandas style cast (astype) -- Key: SPARK-7394 URL: https://issues.apache.org/jira/browse/SPARK-7394 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter Basically alias astype == cast in Column for Python (and Python only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7404) Add RegressionEvaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-7404: Assignee: Xiangrui Meng Add RegressionEvaluator to spark.ml --- Key: SPARK-7404 URL: https://issues.apache.org/jira/browse/SPARK-7404 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng This allows users to tune regression models using the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7745) Replace assertions with requires (IllegalArgumentException) and modify other state checks
[ https://issues.apache.org/jira/browse/SPARK-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-7745. -- Resolution: Fixed Fix Version/s: 1.4.0 Replace assertions with requires (IllegalArgumentException) and modify other state checks - Key: SPARK-7745 URL: https://issues.apache.org/jira/browse/SPARK-7745 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Burak Yavuz Assignee: Burak Yavuz Priority: Minor Fix For: 1.4.0 Assertions can be turned off. Require throws an IllegalArgumentException, which makes more sense, when it's a variable set by the user. Also, some SparkException's can be changed to IllegalStateExceptions for the streaming spark context state checks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7745) Replace assertions with requires (IllegalArgumentException) and modify other state checks
[ https://issues.apache.org/jira/browse/SPARK-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7745: - Assignee: Burak Yavuz Replace assertions with requires (IllegalArgumentException) and modify other state checks - Key: SPARK-7745 URL: https://issues.apache.org/jira/browse/SPARK-7745 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Burak Yavuz Assignee: Burak Yavuz Priority: Minor Fix For: 1.4.0 Assertions can be turned off. Require throws an IllegalArgumentException, which makes more sense, when it's a variable set by the user. Also, some SparkException's can be changed to IllegalStateExceptions for the streaming spark context state checks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients
[ https://issues.apache.org/jira/browse/SPARK-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-7789: Description: After creating a hbase table in beeline, then execute select sql statement, Executor occurs the exception: bq.java.lang.IllegalStateException: Error while configuring input job properties at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343) at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279) at org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804) at org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:220) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hbase.security.AccessDeniedException: org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only allowed for Kerberos authenticated clients at org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:124) at org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService$1.getAuthenticationToken(AuthenticationProtos.java:4267) at org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService.callMethod(AuthenticationProtos.java:4387) at org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7696) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1877) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1859) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32209) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2131) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:102) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:326)
[jira] [Created] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients
meiyoula created SPARK-7789: --- Summary: sql on security hbase:Token generation only allowed for Kerberos authenticated clients Key: SPARK-7789 URL: https://issues.apache.org/jira/browse/SPARK-7789 Project: Spark Issue Type: Bug Components: SQL Reporter: meiyoula After creating a hbase table in beeline, then execute select sql statement, Executor occurs the exception: bq. java.lang.IllegalStateException: Error while configuring input job properties at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343) at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279) at org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804) at org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:220) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hbase.security.AccessDeniedException: org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only allowed for Kerberos authenticated clients at org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:124) at org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService$1.getAuthenticationToken(AuthenticationProtos.java:4267) at org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService.callMethod(AuthenticationProtos.java:4387) at org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7696) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1877) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1859) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32209) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2131) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:102) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at
[jira] [Commented] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients
[ https://issues.apache.org/jira/browse/SPARK-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553958#comment-14553958 ] meiyoula commented on SPARK-7789: - I use the hive source of https://github.com/pwendell/hive/, it has this bug. But when I merge the HIVE-8874 into the hive source, everything goes to good. [~deanchen]Which hive do you use, do you meet this problem? [~pwendell] I think this is a bug, Can you merge HIVE-8874 into you hive repository? sql on security hbase:Token generation only allowed for Kerberos authenticated clients --- Key: SPARK-7789 URL: https://issues.apache.org/jira/browse/SPARK-7789 Project: Spark Issue Type: Bug Components: SQL Reporter: meiyoula After creating a hbase table in beeline, then execute select sql statement, Executor occurs the exception: {quote} java.lang.IllegalStateException: Error while configuring input job properties at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343) at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279) at org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804) at org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:220) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hbase.security.AccessDeniedException: org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only allowed for Kerberos authenticated clients at org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:124) at org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService$1.getAuthenticationToken(AuthenticationProtos.java:4267) at org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService.callMethod(AuthenticationProtos.java:4387) at org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7696) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1877) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1859) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32209) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2131) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:102) at
[jira] [Updated] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients
[ https://issues.apache.org/jira/browse/SPARK-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-7789: Description: After creating a hbase table in beeline, then execute select sql statement, Executor occurs the exception: {quote} java.lang.IllegalStateException: Error while configuring input job properties at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343) at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279) at org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804) at org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:220) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hbase.security.AccessDeniedException: org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only allowed for Kerberos authenticated clients at org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:124) at org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService$1.getAuthenticationToken(AuthenticationProtos.java:4267) at org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService.callMethod(AuthenticationProtos.java:4387) at org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7696) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1877) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1859) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32209) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2131) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:102) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:326)
[jira] [Created] (SPARK-7796) Use the new RPC implementation by default
Shixiong Zhu created SPARK-7796: --- Summary: Use the new RPC implementation by default Key: SPARK-7796 URL: https://issues.apache.org/jira/browse/SPARK-7796 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu Fix For: 1.6.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7722) Style checks do not run for Kinesis on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-7722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7722: --- Assignee: Tathagata Das (was: Apache Spark) Style checks do not run for Kinesis on Jenkins -- Key: SPARK-7722 URL: https://issues.apache.org/jira/browse/SPARK-7722 Project: Spark Issue Type: Bug Components: Project Infra, Streaming Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Critical This caused the release build to fail late in the game. We should make sure jenkins is proactively checking it: https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commitdiff;h=23cf897112624ece19a3b5e5394cdf71b9c3c8b3;hp=9ebb44f8abb1a13f045eed60190954db904ffef7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554797#comment-14554797 ] Akshat Aranya edited comment on SPARK-7708 at 5/21/15 6:22 PM: --- I was able to get this working with a couple of fixes: 1. Implementing serialization methods for Kryo in SerializableBuffer. An alternative is to register SerializableBuffer with JavaSerialization in Kryo, but that defeats the purpose. 2. The second part is a bit hokey because tasks within one executor process are deserialized from a shared broadcast variable. Kryo deserialization modifies the input buffer, so it isn't thread-safe (https://code.google.com/p/kryo/issues/detail?id=128). I worked around this by copying the broadcast buffer to a local buffer before deserializing. This fixes are for 1.2, so I'll see if I can port them to master and write a test for them. was (Author: aaranya): I was able to get this working with a couple of fixes: 1. Implementing serialization methods for Kryo in SerializableBuffer. An alternative is to register SerializableBuffer with JavaSerialization in Kryo, but that defeats the purpose. 2. The second part is a bit hokey because tasks within one executor process are deserialized from a shared broadcast variable. Kryo deserialization modifies the input buffer, so it isn't thread-safe (https://code.google.com/p/kryo/issues/detail?id=128). I worked around this by copying the broadcast buffer to a local buffer before deserializing. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7722) Style checks do not run for Kinesis on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-7722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7722: --- Assignee: Apache Spark (was: Tathagata Das) Style checks do not run for Kinesis on Jenkins -- Key: SPARK-7722 URL: https://issues.apache.org/jira/browse/SPARK-7722 Project: Spark Issue Type: Bug Components: Project Infra, Streaming Reporter: Patrick Wendell Assignee: Apache Spark Priority: Critical This caused the release build to fail late in the game. We should make sure jenkins is proactively checking it: https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commitdiff;h=23cf897112624ece19a3b5e5394cdf71b9c3c8b3;hp=9ebb44f8abb1a13f045eed60190954db904ffef7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6764) Add wheel package support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554825#comment-14554825 ] Davies Liu commented on SPARK-6764: --- My first question is that, can we use wheel package just like egg file ( without pip in the worker)? Package or download them in the driver, then send them to workers. Add wheel package support for PySpark - Key: SPARK-6764 URL: https://issues.apache.org/jira/browse/SPARK-6764 Project: Spark Issue Type: Improvement Components: Deploy, PySpark Reporter: Takao Magoori Priority: Minor Labels: newbie We can do _spark-submit_ with one or more Python packages (.egg,.zip and .jar) by *--py-files* option. h4. zip packaging Spark put a zip file on its working directory and adds the absolute path to Python's sys.path. When the user program imports it, [zipimport|https://docs.python.org/2.7/library/zipimport.html] is automatically invoked under the hood. That is, data-files and dynamic modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and .pyo. h4. egg packaging Spark put an egg file on its working directory and adds the absolute path to Python's sys.path. Unlike zipimport, egg can handle data files and dynamid modules as far as the author of the package uses [pkg_resources API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations] properly. But so many python modules does not use pkg_resources API, that causes ImportErroror No such file error. Moreover, creating eggs of dependencies and further dependencies are troublesome job. h4. wheel packaging Supporting new Python standard package-format [wheel|https://wheel.readthedocs.org/en/latest/]; would be nice. With wheel, we can do spark-submit with complex dependencies simply as follows. 1. Write requirements.txt file. {noformat} SQLAlchemy MySQL-python requests simplejson=3.6.0,=3.6.5 pydoop {noformat} 2. Do wheel packaging by only one command. All dependencies are wheel-ed. {noformat} $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement requirements.txt {noformat} 3. Do spark-submit {noformat} your_spark_home/bin/spark-submit --master local[4] --py-files $(find /tmp/wheelhouse/ -name *.whl -print0 | sed -e 's/\x0/,/g') your_driver.py {noformat} If your pyspark driver is a package which consists of many modules, 1. Write setup.py for your pyspark driver package. {noformat} from setuptools import ( find_packages, setup, ) setup( name='yourpkg', version='0.0.1', packages=find_packages(), install_requires=[ 'SQLAlchemy', 'MySQL-python', 'requests', 'simplejson=3.6.0,=3.6.5', 'pydoop', ], ) {noformat} 2. Do wheel packaging by only one command. Your driver package and all dependencies are wheel-ed. {noformat} your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/. {noformat} 3. Do spark-submit {noformat} your_spark_home/bin/spark-submit --master local[4] --py-files $(find /tmp/wheelhouse/ -name *.whl -print0 | sed -e 's/\x0/,/g') your_driver_bootstrap.py {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7724) Add support for Intersect and Except in Catalyst DSL
[ https://issues.apache.org/jira/browse/SPARK-7724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554854#comment-14554854 ] Apache Spark commented on SPARK-7724: - User 'smola' has created a pull request for this issue: https://github.com/apache/spark/pull/6327 Add support for Intersect and Except in Catalyst DSL Key: SPARK-7724 URL: https://issues.apache.org/jira/browse/SPARK-7724 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Santiago M. Mola Priority: Trivial Labels: easyfix, starter Catalyst DSL to create logical plans supports most of the current plan, but it is missing Except and Intersect. See LogicalPlanFunctions: https://github.com/apache/spark/blob/6008ec14ed6491d0a854bb50548c46f2f9709269/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L248 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7795) Speed up task serialization in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554721#comment-14554721 ] Apache Spark commented on SPARK-7795: - User 'coolfrood' has created a pull request for this issue: https://github.com/apache/spark/pull/6323 Speed up task serialization in standalone mode -- Key: SPARK-7795 URL: https://issues.apache.org/jira/browse/SPARK-7795 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Akshat Aranya My experiments with scheduling very short tasks in standalone cluster mode indicated that a significant amount of time was being spent in scheduling the tasks (500ms for 256 tasks). I found that most of the time was being spent in creating a new instance of serializer for each task. Changing this to just one serializer brought down the scheduling time to 8ms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7795) Speed up task serialization in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7795: --- Assignee: (was: Apache Spark) Speed up task serialization in standalone mode -- Key: SPARK-7795 URL: https://issues.apache.org/jira/browse/SPARK-7795 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Akshat Aranya My experiments with scheduling very short tasks in standalone cluster mode indicated that a significant amount of time was being spent in scheduling the tasks (500ms for 256 tasks). I found that most of the time was being spent in creating a new instance of serializer for each task. Changing this to just one serializer brought down the scheduling time to 8ms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized
[ https://issues.apache.org/jira/browse/SPARK-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7780: - Target Version/s: 1.5.0 The intercept in LogisticRegressionWithLBFGS should not be regularized -- Key: SPARK-7780 URL: https://issues.apache.org/jira/browse/SPARK-7780 Project: Spark Issue Type: Bug Components: MLlib Reporter: DB Tsai The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through `Updater`, and the `Updater` penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7797) Remove actorSystem from SparkEnv
Shixiong Zhu created SPARK-7797: --- Summary: Remove actorSystem from SparkEnv Key: SPARK-7797 URL: https://issues.apache.org/jira/browse/SPARK-7797 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7800) isDefined should not marked too early in putNewKey
Liang-Chi Hsieh created SPARK-7800: -- Summary: isDefined should not marked too early in putNewKey Key: SPARK-7800 URL: https://issues.apache.org/jira/browse/SPARK-7800 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Liang-Chi Hsieh isDefined is marked as true twice in Location.putNewKey. The first one is unnecessary and will cause problem because it is too early and before some assert checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7737) parquet schema discovery should not fail because of empty _temporary dir
[ https://issues.apache.org/jira/browse/SPARK-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-7737: Assignee: Yin Huai (was: Cheng Lian) parquet schema discovery should not fail because of empty _temporary dir - Key: SPARK-7737 URL: https://issues.apache.org/jira/browse/SPARK-7737 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Parquet schema discovery will fail when the dir is like {code} /partitions5k/i=2/_SUCCESS /partitions5k/i=2/_temporary/ /partitions5k/i=2/part-r-1.gz.parquet /partitions5k/i=2/part-r-2.gz.parquet /partitions5k/i=2/part-r-3.gz.parquet /partitions5k/i=2/part-r-4.gz.parquet {code} {code} java.lang.AssertionError: assertion failed: Conflicting partition column names detected: at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:159) at org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:71) at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:468) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:424) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:423) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:422) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:482) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:480) at org.apache.spark.sql.sources.LogicalRelation.init(LogicalRelation.scala:30) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:134) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:118) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1135) {code} 1.3 works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7737) parquet schema discovery should not fail because of empty _temporary dir
[ https://issues.apache.org/jira/browse/SPARK-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554769#comment-14554769 ] Yin Huai commented on SPARK-7737: - Seems https://github.com/apache/spark/pull/6287 still not fix partition discovery completely. For the case in the description, the following case works {code} load(/partitions5k/i=2/, parquet) {code} However, for this case, we fail... {code} load(/partitions5k/, parquet) {code} parquet schema discovery should not fail because of empty _temporary dir - Key: SPARK-7737 URL: https://issues.apache.org/jira/browse/SPARK-7737 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Cheng Lian Priority: Blocker Parquet schema discovery will fail when the dir is like {code} /partitions5k/i=2/_SUCCESS /partitions5k/i=2/_temporary/ /partitions5k/i=2/part-r-1.gz.parquet /partitions5k/i=2/part-r-2.gz.parquet /partitions5k/i=2/part-r-3.gz.parquet /partitions5k/i=2/part-r-4.gz.parquet {code} {code} java.lang.AssertionError: assertion failed: Conflicting partition column names detected: at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:159) at org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:71) at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:468) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:424) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:423) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:422) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:482) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:480) at org.apache.spark.sql.sources.LogicalRelation.init(LogicalRelation.scala:30) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:134) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:118) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1135) {code} 1.3 works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7800) isDefined should not marked too early in putNewKey
[ https://issues.apache.org/jira/browse/SPARK-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-7800: -- Affects Version/s: 1.4.0 isDefined should not marked too early in putNewKey -- Key: SPARK-7800 URL: https://issues.apache.org/jira/browse/SPARK-7800 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Liang-Chi Hsieh isDefined is marked as true twice in Location.putNewKey. The first one is unnecessary and will cause problem because it is too early and before some assert checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7800) isDefined should not marked too early in putNewKey
[ https://issues.apache.org/jira/browse/SPARK-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-7800: -- Assignee: Liang-Chi Hsieh isDefined should not marked too early in putNewKey -- Key: SPARK-7800 URL: https://issues.apache.org/jira/browse/SPARK-7800 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor isDefined is marked as true twice in Location.putNewKey. The first one is unnecessary and will cause problem because it is too early and before some assert checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7800) isDefined should not marked too early in putNewKey
[ https://issues.apache.org/jira/browse/SPARK-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-7800: -- Priority: Minor (was: Major) isDefined should not marked too early in putNewKey -- Key: SPARK-7800 URL: https://issues.apache.org/jira/browse/SPARK-7800 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Liang-Chi Hsieh Priority: Minor isDefined is marked as true twice in Location.putNewKey. The first one is unnecessary and will cause problem because it is too early and before some assert checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7749) Parquet metastore conversion does not use metastore cache
[ https://issues.apache.org/jira/browse/SPARK-7749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-7749. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6287 [https://github.com/apache/spark/pull/6287] Parquet metastore conversion does not use metastore cache - Key: SPARK-7749 URL: https://issues.apache.org/jira/browse/SPARK-7749 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Cheng Lian Priority: Blocker Fix For: 1.4.0 Seems https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L462-467 is wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7712) Native Spark Window Functions Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7712: Shepherd: Yin Huai Native Spark Window Functions Performance Improvements - Key: SPARK-7712 URL: https://issues.apache.org/jira/browse/SPARK-7712 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Herman van Hovell tot Westerflier Fix For: 1.5.0 Original Estimate: 336h Remaining Estimate: 336h Hi All, After playing with the current spark window implementation, I tried to take this to next level. My main goal is/was to address the following issues: Native Spark SQL Performance. *Native Spark SQL* The current implementation uses Hive UDAFs as its aggregation mechanism. We try to address the following issues by moving to a more 'native' Spark SQL approach: - Window functions require Hive. Some people (mostly by accident) use Spark SQL without Hive. Usage of UDAFs is still supported though. - Adding your own Aggregates requires you to write them in Hive instead of native Spark SQL. - Hive UDAFs are very well written and quite quick, but they are opaque in processing and memory management; this makes them hard to optimize. By using 'Native' Spark SQL constructs we can actually do alot more optimization, for example AggregateEvaluation style Window processing (this would require us to move some of the code out of the AggregateEvaluation class into some Common base class), or Tungten style memory management. *Performance* - Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse. - Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement. - A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. - A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code. The original work including some benchmarking code for the running case can be here: https://github.com/hvanhovell/spark-window A PR has been created, this is still work in progress, and can be found here: https://github.com/apache/spark/pull/6278 Comments, feedback and other discussion is much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7712) Native Spark Window Functions Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554753#comment-14554753 ] Herman van Hovell tot Westerflier commented on SPARK-7712: -- I have updated the description to explain the design. Native Spark Window Functions Performance Improvements - Key: SPARK-7712 URL: https://issues.apache.org/jira/browse/SPARK-7712 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Herman van Hovell tot Westerflier Fix For: 1.5.0 Original Estimate: 336h Remaining Estimate: 336h Hi All, After playing with the current spark window implementation, I tried to take this to next level. My main goal is/was to address the following issues: Native Spark SQL Performance. *Native Spark SQL* The current implementation uses Hive UDAFs as its aggregation mechanism. We try to address the following issues by moving to a more 'native' Spark SQL approach: - Window functions require Hive. Some people (mostly by accident) use Spark SQL without Hive. Usage of UDAFs is still supported though. - Adding your own Aggregates requires you to write them in Hive instead of native Spark SQL. - Hive UDAFs are very well written and quite quick, but they are opaque in processing and memory management; this makes them hard to optimize. By using 'Native' Spark SQL constructs we can actually do alot more optimization, for example AggregateEvaluation style Window processing (this would require us to move some of the code out of the AggregateEvaluation class into some Common base class), or Tungten style memory management. *Performance* - Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse. - Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement. - A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. - A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code. The original work including some benchmarking code for the running case can be here: https://github.com/hvanhovell/spark-window A PR has been created, this is still work in progress, and can be found here: https://github.com/apache/spark/pull/6278 Comments, feedback and other discussion is much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7758) Failed to start thrift server when metastore is postgre sql
[ https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7758: Target Version/s: 1.4.0 Affects Version/s: 1.4.0 Failed to start thrift server when metastore is postgre sql --- Key: SPARK-7758 URL: https://issues.apache.org/jira/browse/SPARK-7758 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Tao Wang Priority: Critical Attachments: hive-site.xml, with error.log, with no error.log I am using today's master branch to start thrift server with setting metastore to postgre sql, and it shows error like: {code} 15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE DELETEME1432125837197 CASCADE : Syntax error: Encountered CASCADE at line 1, column 34. java.sql.SQLSyntaxErrorException: Syntax error: Encountered CASCADE at line 1, column 34. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264) at org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)` But it works well with earlier master branch (on 7th, April). After printing their debug level log, I found current branch tries to connect with derby but didn't know why, maybe the big reconstructure in sql module cause this issue. The Datastore shows in current branch: 15/05/20 20:43:57 DEBUG Datastore: === Datastore = 15/05/20 20:43:57 DEBUG Datastore: StoreManager : rdbms (org.datanucleus.store.rdbms.RDBMSStoreManager) 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), Validate(None) 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, STOREDPROC] 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0 15/05/20 20:43:57 DEBUG Datastore: === 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter 15/05/20 20:43:57 DEBUG Datastore: Datastore : name=Apache Derby version=10.10.1.1 - (1458268) 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name=Apache Derby Embedded JDBC Driver version=10.10.1.1 - (1458268) 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : factory=datanucleus1 case=UPPERCASE catalog= schema=SPARK 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : MixedCase UPPERCASE MixedCase-Sensitive 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : Table=128 Column=128 Constraint=128 Index=128 Delimiter= 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : catalog=false schema=true 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes (max-batch-size=50) 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, type=forward-only, concurrency=read-only 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, VARBINARY, JAVA_OBJECT 15/05/20 20:43:57 DEBUG Datastore: === The Datastore in earlier master branch: 15/05/20 20:18:10 DEBUG Datastore: === Datastore = 15/05/20 20:18:10 DEBUG Datastore:
[jira] [Closed] (SPARK-7773) Document `spark.dynamicAllocation.initialExecutors`
[ https://issues.apache.org/jira/browse/SPARK-7773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7773. Resolution: Won't Fix Turns out it was already documented? Document `spark.dynamicAllocation.initialExecutors` --- Key: SPARK-7773 URL: https://issues.apache.org/jira/browse/SPARK-7773 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or A small omission. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554797#comment-14554797 ] Akshat Aranya commented on SPARK-7708: -- I was able to get this working with a couple of fixes: 1. Implementing serialization methods for Kryo in SerializableBuffer. An alternative is to register SerializableBuffer with JavaSerialization in Kryo, but that defeats the purpose. 2. The second part is a bit hokey because tasks within one executor process are deserialized from a shared broadcast variable. Kryo deserialization modifies the input buffer, so it isn't thread-safe (https://code.google.com/p/kryo/issues/detail?id=128). I worked around this by copying the broadcast buffer to a local buffer before deserializing. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7787) SerializableAWSCredentials in KinesisReceiver cannot be deserialized
[ https://issues.apache.org/jira/browse/SPARK-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7787: - Reporter: Chris Fregly (was: Tathagata Das) SerializableAWSCredentials in KinesisReceiver cannot be deserialized - Key: SPARK-7787 URL: https://issues.apache.org/jira/browse/SPARK-7787 Project: Spark Issue Type: Bug Components: Streaming Reporter: Chris Fregly Assignee: Tathagata Das Priority: Blocker Fix For: 1.4.0 Lack of default constructor causes deserialization to fail. This occurs only when the AWS credentials are explicitly specified through KinesisUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7787) SerializableAWSCredentials in KinesisReceiver cannot be deserialized
[ https://issues.apache.org/jira/browse/SPARK-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-7787. -- Resolution: Fixed Fix Version/s: 1.4.0 SerializableAWSCredentials in KinesisReceiver cannot be deserialized - Key: SPARK-7787 URL: https://issues.apache.org/jira/browse/SPARK-7787 Project: Spark Issue Type: Bug Components: Streaming Reporter: Chris Fregly Assignee: Tathagata Das Priority: Blocker Fix For: 1.4.0 Lack of default constructor causes deserialization to fail. This occurs only when the AWS credentials are explicitly specified through KinesisUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7712) Native Spark Window Functions Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell tot Westerflier updated SPARK-7712: - Description: Hi All, After playing with the current spark window implementation, I tried to take this to next level. My main goal is/was to address the following issues: Native Spark SQL Performance. *Native Spark SQL* The current implementation uses Hive UDAFs as its aggregation mechanism. We try to address the following issues by moving to a more 'native' Spark SQL approach: - Window functions require Hive. Some people (mostly by accident) use Spark SQL without Hive. Usage of UDAFs is still supported though. - Adding your own Aggregates requires you to write them in Hive instead of native Spark SQL. - Hive UDAFs are very well written and quite quick, but they are opaque in processing and memory management; this makes them hard to optimize. By using 'Native' Spark SQL constructs we can actually do alot more optimization, for example AggregateEvaluation style Window processing (this would require us to move some of the code out of the AggregateEvaluation class into some Common base class), or Tungten style memory management. *Performance* - Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse. - Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement. - A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. - A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code. The original work including some benchmarking code for the running case can be here: https://github.com/hvanhovell/spark-window A PR has been created, this is still work in progress, and can be found here: https://github.com/apache/spark/pull/6278 Comments, feedback and other discussion is much appreciated. was: Hi All, After playing with the current spark window implementation, I tried to take this to next level. My main goal is/was to address the following issues: Native Spark SQL Performance. *Native Spark SQL* The current implementation uses Hive UDAFs as its aggregation mechanism. We try to address the following issues by moving to a more 'native' Spark SQL approach: -Window functions require Hive. Some people (mostly by accident) use Spark SQL without Hive. Usage of UDAFs is still supported though. -Adding your own Aggregates requires you to write them in Hive instead of native Spark SQL. -Hive UDAFs are very well written and quite quick, but they are opaque in processing and memory management; this makes them hard to optimize. By using 'Native' Spark SQL constructs we can actually do alot more optimization, for example AggregateEvaluation style Window processing (this would require us to move some of the code out of the AggregateEvaluation class into some Common base class), or Tungten style memory management. *Performance* -Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are
[jira] [Commented] (SPARK-7796) Use the new RPC implementation by default
[ https://issues.apache.org/jira/browse/SPARK-7796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554748#comment-14554748 ] Shixiong Zhu commented on SPARK-7796: - Sorry. Set a wrong field. Use the new RPC implementation by default - Key: SPARK-7796 URL: https://issues.apache.org/jira/browse/SPARK-7796 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7758) Failed to start thrift server when metastore is postgre sql
[ https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7758: Priority: Blocker (was: Critical) Failed to start thrift server when metastore is postgre sql --- Key: SPARK-7758 URL: https://issues.apache.org/jira/browse/SPARK-7758 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Tao Wang Priority: Blocker Attachments: hive-site.xml, with error.log, with no error.log I am using today's master branch to start thrift server with setting metastore to postgre sql, and it shows error like: {code} 15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE DELETEME1432125837197 CASCADE : Syntax error: Encountered CASCADE at line 1, column 34. java.sql.SQLSyntaxErrorException: Syntax error: Encountered CASCADE at line 1, column 34. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264) at org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)` But it works well with earlier master branch (on 7th, April). After printing their debug level log, I found current branch tries to connect with derby but didn't know why, maybe the big reconstructure in sql module cause this issue. The Datastore shows in current branch: 15/05/20 20:43:57 DEBUG Datastore: === Datastore = 15/05/20 20:43:57 DEBUG Datastore: StoreManager : rdbms (org.datanucleus.store.rdbms.RDBMSStoreManager) 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), Validate(None) 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, STOREDPROC] 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0 15/05/20 20:43:57 DEBUG Datastore: === 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter 15/05/20 20:43:57 DEBUG Datastore: Datastore : name=Apache Derby version=10.10.1.1 - (1458268) 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name=Apache Derby Embedded JDBC Driver version=10.10.1.1 - (1458268) 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : factory=datanucleus1 case=UPPERCASE catalog= schema=SPARK 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : MixedCase UPPERCASE MixedCase-Sensitive 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : Table=128 Column=128 Constraint=128 Index=128 Delimiter= 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : catalog=false schema=true 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes (max-batch-size=50) 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, type=forward-only, concurrency=read-only 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, VARBINARY, JAVA_OBJECT 15/05/20 20:43:57 DEBUG Datastore: === The Datastore in earlier master branch: 15/05/20 20:18:10 DEBUG Datastore: === Datastore = 15/05/20 20:18:10 DEBUG Datastore: StoreManager :
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554779#comment-14554779 ] Josh Rosen commented on SPARK-7708: --- It doesn't surprise me that Kryo doesn't work for closure serialization, since I don't know that we have any end-to-end tests that run real jobs with the Kryo closure serializer (we do have tests for using it for data serialization, though). I'd really like to fix this, though, so a pull request with a failing regression test and a fix would be very welcome. Perhaps we need to register SerializableBuffer with Kryo or do something similar so that Kryo handles this case properly. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6740) SQL operator and condition precedence is not honoured
[ https://issues.apache.org/jira/browse/SPARK-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6740: --- Assignee: (was: Apache Spark) SQL operator and condition precedence is not honoured - Key: SPARK-6740 URL: https://issues.apache.org/jira/browse/SPARK-6740 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola The following query from the SQL Logic Test suite fails to parse: SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT ( - _2 + - 39 ) IS NULL while the following (equivalent) does parse correctly: SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT (( - _2 + - 39 ) IS NULL) SQLite, MySQL and Oracle (and probably most SQL implementations) define IS with higher precedence than NOT, so the first query is valid and well-defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6740) SQL operator and condition precedence is not honoured
[ https://issues.apache.org/jira/browse/SPARK-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554834#comment-14554834 ] Apache Spark commented on SPARK-6740: - User 'smola' has created a pull request for this issue: https://github.com/apache/spark/pull/6326 SQL operator and condition precedence is not honoured - Key: SPARK-6740 URL: https://issues.apache.org/jira/browse/SPARK-6740 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola The following query from the SQL Logic Test suite fails to parse: SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT ( - _2 + - 39 ) IS NULL while the following (equivalent) does parse correctly: SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT (( - _2 + - 39 ) IS NULL) SQLite, MySQL and Oracle (and probably most SQL implementations) define IS with higher precedence than NOT, so the first query is valid and well-defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6740) SQL operator and condition precedence is not honoured
[ https://issues.apache.org/jira/browse/SPARK-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6740: --- Assignee: Apache Spark SQL operator and condition precedence is not honoured - Key: SPARK-6740 URL: https://issues.apache.org/jira/browse/SPARK-6740 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Assignee: Apache Spark The following query from the SQL Logic Test suite fails to parse: SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT ( - _2 + - 39 ) IS NULL while the following (equivalent) does parse correctly: SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT (( - _2 + - 39 ) IS NULL) SQLite, MySQL and Oracle (and probably most SQL implementations) define IS with higher precedence than NOT, so the first query is valid and well-defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6956) Improve DataFrame API compatibility with Pandas
[ https://issues.apache.org/jira/browse/SPARK-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6956: --- Target Version/s: 1.4.0 (was: 1.5.0) Improve DataFrame API compatibility with Pandas --- Key: SPARK-6956 URL: https://issues.apache.org/jira/browse/SPARK-6956 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Fix For: 1.4.0 This is not always possible, but whenever possible we should remove or reduce the differences between Pandas and Spark DataFrames in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7394) Add Pandas style cast (astype)
[ https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7394. Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Chen Song Add Pandas style cast (astype) -- Key: SPARK-7394 URL: https://issues.apache.org/jira/browse/SPARK-7394 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Chen Song Labels: starter Fix For: 1.4.0 Basically alias astype == cast in Column for Python (and Python only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554819#comment-14554819 ] Josh Rosen commented on SPARK-7708: --- I can investigate this in more detail later, but is registering SerializableBuffer with JavaSerialization really that expensive? I imagine that the most expensive part of the closure serialization is serializing the closure itself. Once we already have the serialized closure's bytes, I don't think that including them in the TaskDescription should be very expensive because we're only dealing with one or two fields vs. an entire object graph. On the other hand, if it's not much work to implement Kryo serialization methods for that class then maybe that's fine, too. Good find with that Kryo deserialization thread-safety issue; I'd be interested in seeing your fix for this. Feel free to submit a PR for this; I'd be happy to help review. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-7724) Add support for Intersect and Except in Catalyst DSL
[ https://issues.apache.org/jira/browse/SPARK-7724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santiago M. Mola reopened SPARK-7724: - Thanks. Here's a PR. Add support for Intersect and Except in Catalyst DSL Key: SPARK-7724 URL: https://issues.apache.org/jira/browse/SPARK-7724 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Santiago M. Mola Priority: Trivial Labels: easyfix, starter Catalyst DSL to create logical plans supports most of the current plan, but it is missing Except and Intersect. See LogicalPlanFunctions: https://github.com/apache/spark/blob/6008ec14ed6491d0a854bb50548c46f2f9709269/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L248 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7781) GradientBoostedTrees.trainRegressor is missing maxBins parameter in pyspark
[ https://issues.apache.org/jira/browse/SPARK-7781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7781: - Target Version/s: 1.5.0 GradientBoostedTrees.trainRegressor is missing maxBins parameter in pyspark --- Key: SPARK-7781 URL: https://issues.apache.org/jira/browse/SPARK-7781 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Don Drake I'm running Spark v1.3.1 and when I run the following against my dataset: {code} model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3) The job will fail with the following message: Traceback (most recent call last): File /Users/drake/fd/spark/mltest.py, line 73, in module model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py, line 553, in trainRegressor loss, numIterations, learningRate, maxDepth) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py, line 438, in _train loss, numIterations, learningRate, maxDepth) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py, line 120, in callMLlibFunc return callJavaFunc(sc, api, *args) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py, line 113, in callJavaFunc return _java2py(sc, func(*args)) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95 py4j.protocol.Py4JJavaError: An error occurred while calling o69.trainGradientBoostedTreesModel. : java.lang.IllegalArgumentException: requirement failed: DecisionTree requires maxBins (= 32) = max categories in categorical features (= 1895) at scala.Predef$.require(Predef.scala:233) at org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128) at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138) at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60) at org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150) at org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63) at org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96) at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595) {code} So, it's complaining about the maxBins, if I provide maxBins=1900 and re-run it: {code} model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3, maxBins=1900) Traceback (most recent call last): File /Users/drake/fd/spark/mltest.py, line 73, in module model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catF eatures, maxDepth=6, numIterations=3, maxBins=1900) TypeError: trainRegressor() got an unexpected keyword argument 'maxBins' {code} It now says it knows nothing of maxBins. If I run the same command against DecisionTree or RandomForest (with maxBins=1900) it works just fine. Seems like a bug in GradientBoostedTrees. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7798) Move AkkaRpcEnv to a separate project
Shixiong Zhu created SPARK-7798: --- Summary: Move AkkaRpcEnv to a separate project Key: SPARK-7798 URL: https://issues.apache.org/jira/browse/SPARK-7798 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7796) Use the new RPC implementation by default
[ https://issues.apache.org/jira/browse/SPARK-7796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7796: - Fix Version/s: (was: 1.6.0) [~zsxwing] don't set Fix Version Use the new RPC implementation by default - Key: SPARK-7796 URL: https://issues.apache.org/jira/browse/SPARK-7796 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7712) Native Spark Window Functions Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell tot Westerflier updated SPARK-7712: - Description: Hi All, After playing with the current spark window implementation, I tried to take this to next level. My main goal is/was to address the following issues: Native Spark SQL Performance. *Native Spark SQL* The current implementation uses Hive UDAFs as its aggregation mechanism. We try to address the following issues by moving to a more 'native' Spark SQL approach: -Window functions require Hive. Some people (mostly by accident) use Spark SQL without Hive. Usage of UDAFs is still supported though. -Adding your own Aggregates requires you to write them in Hive instead of native Spark SQL. -Hive UDAFs are very well written and quite quick, but they are opaque in processing and memory management; this makes them hard to optimize. By using 'Native' Spark SQL constructs we can actually do alot more optimization, for example AggregateEvaluation style Window processing (this would require us to move some of the code out of the AggregateEvaluation class into some Common base class), or Tungten style memory management. *Performance* - Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse. - Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement. - A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. - A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code. The original work including some benchmarking code for the running case can be here: https://github.com/hvanhovell/spark-window A PR has been created, this is still work in progress, and can be found here: I will try to turn this into a PR in the next couple of days. Meanwhile comments, feedback and other discussion is much appreciated. was: Hi All, After playing with the current spark window implementation, I tried to take this to next level. My main goal is/was to address the following issues: - Native Spark-SQL, the current implementation relies only on Hive UDAFs. The improved implementation uses Spark SQL Aggregates. Hive UDAF's are still supported though. - Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. - Increased optimization opportunities. AggregateEvaluation style optimization should be possible for in frame processing. Tungsten might also provide interesting optimization opportunities. The current work is available at the following location: https://github.com/hvanhovell/spark-window I will try to turn this into a PR in the next couple of days. Meanwhile comments, feedback and other discussion is much appreciated. Native Spark Window Functions Performance Improvements - Key: SPARK-7712 URL: https://issues.apache.org/jira/browse/SPARK-7712 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Herman van Hovell tot Westerflier Fix For: 1.5.0 Original Estimate: 336h Remaining Estimate: 336h Hi All, After playing with the current spark window implementation, I tried to take this to next level. My main goal is/was to address the following
[jira] [Updated] (SPARK-7797) Remove actorSystem from SparkEnv
[ https://issues.apache.org/jira/browse/SPARK-7797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-7797: Target Version/s: 1.6.0 Remove actorSystem from SparkEnv -- Key: SPARK-7797 URL: https://issues.apache.org/jira/browse/SPARK-7797 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7800) isDefined should not marked too early in putNewKey
[ https://issues.apache.org/jira/browse/SPARK-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7800: --- Assignee: (was: Apache Spark) isDefined should not marked too early in putNewKey -- Key: SPARK-7800 URL: https://issues.apache.org/jira/browse/SPARK-7800 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Liang-Chi Hsieh isDefined is marked as true twice in Location.putNewKey. The first one is unnecessary and will cause problem because it is too early and before some assert checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7800) isDefined should not marked too early in putNewKey
[ https://issues.apache.org/jira/browse/SPARK-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554746#comment-14554746 ] Apache Spark commented on SPARK-7800: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/6324 isDefined should not marked too early in putNewKey -- Key: SPARK-7800 URL: https://issues.apache.org/jira/browse/SPARK-7800 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Liang-Chi Hsieh isDefined is marked as true twice in Location.putNewKey. The first one is unnecessary and will cause problem because it is too early and before some assert checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7800) isDefined should not marked too early in putNewKey
[ https://issues.apache.org/jira/browse/SPARK-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7800: --- Assignee: Apache Spark isDefined should not marked too early in putNewKey -- Key: SPARK-7800 URL: https://issues.apache.org/jira/browse/SPARK-7800 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Liang-Chi Hsieh Assignee: Apache Spark isDefined is marked as true twice in Location.putNewKey. The first one is unnecessary and will cause problem because it is too early and before some assert checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7758) Failed to start thrift server when metastore is postgre sql
[ https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7758: Description: I am using today's master branch to start thrift server with setting metastore to postgre sql, and it shows error like: {code} 15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE DELETEME1432125837197 CASCADE : Syntax error: Encountered CASCADE at line 1, column 34. java.sql.SQLSyntaxErrorException: Syntax error: Encountered CASCADE at line 1, column 34. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) at org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264) at org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)` But it works well with earlier master branch (on 7th, April). After printing their debug level log, I found current branch tries to connect with derby but didn't know why, maybe the big reconstructure in sql module cause this issue. The Datastore shows in current branch: 15/05/20 20:43:57 DEBUG Datastore: === Datastore = 15/05/20 20:43:57 DEBUG Datastore: StoreManager : rdbms (org.datanucleus.store.rdbms.RDBMSStoreManager) 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), Validate(None) 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, STOREDPROC] 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0 15/05/20 20:43:57 DEBUG Datastore: === 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter 15/05/20 20:43:57 DEBUG Datastore: Datastore : name=Apache Derby version=10.10.1.1 - (1458268) 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name=Apache Derby Embedded JDBC Driver version=10.10.1.1 - (1458268) 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true] 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : factory=datanucleus1 case=UPPERCASE catalog= schema=SPARK 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : MixedCase UPPERCASE MixedCase-Sensitive 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : Table=128 Column=128 Constraint=128 Index=128 Delimiter= 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : catalog=false schema=true 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes (max-batch-size=50) 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, type=forward-only, concurrency=read-only 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, VARBINARY, JAVA_OBJECT 15/05/20 20:43:57 DEBUG Datastore: === The Datastore in earlier master branch: 15/05/20 20:18:10 DEBUG Datastore: === Datastore = 15/05/20 20:18:10 DEBUG Datastore: StoreManager : rdbms (org.datanucleus.store.rdbms.RDBMSStoreManager) 15/05/20 20:18:10 DEBUG Datastore: Datastore : read-write 15/05/20 20:18:10 DEBUG Datastore: Schema Control : AutoCreate(None), Validate(None) 15/05/20 20:18:10 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, STOREDPROC] 15/05/20 20:18:10 DEBUG Datastore: Queries : Timeout=0 15/05/20 20:18:10 DEBUG Datastore: === 15/05/20 20:18:10 DEBUG Datastore: Datastore Adapter : org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter 15/05/20
[jira] [Commented] (SPARK-7722) Style checks do not run for Kinesis on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-7722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554802#comment-14554802 ] Apache Spark commented on SPARK-7722: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/6325 Style checks do not run for Kinesis on Jenkins -- Key: SPARK-7722 URL: https://issues.apache.org/jira/browse/SPARK-7722 Project: Spark Issue Type: Bug Components: Project Infra, Streaming Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Critical This caused the release build to fail late in the game. We should make sure jenkins is proactively checking it: https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commitdiff;h=23cf897112624ece19a3b5e5394cdf71b9c3c8b3;hp=9ebb44f8abb1a13f045eed60190954db904ffef7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7754) Use PartialFunction literals instead of objects in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7754: - Component/s: SQL Use PartialFunction literals instead of objects in Catalyst --- Key: SPARK-7754 URL: https://issues.apache.org/jira/browse/SPARK-7754 Project: Spark Issue Type: Improvement Components: SQL Reporter: Edoardo Vacchi Catalyst rules extend two distinct rule types: {{Rule[LogicalPlan]}} and {{Strategy}} (which is an alias for {{GenericStrategy[SparkPlan]}}). The distinction is fairly subtle: in the end, both rule types are supposed to define a method {{apply(plan: LogicalPlan)}} (where LogicalPlan is either Logical- or Spark-) which returns a transformed plan (or a sequence thereof, in the case of Strategy). Ceremonies asides, the body of such method is always of the kind: {code:java} def apply(plan: PlanType) = plan match pf {code} where `pf` would be some `PartialFunction` of the PlanType: {code:java} val pf = { case ... = ... } {code} This is JIRA is a proposal to introduce utility methods to a) reduce the boilerplate to define rewrite rules b) turning them back into what they essentially represent: function types. These changes would be backwards compatible, and would greatly help in understanding what the code does. Current use of objects is redundant and possibly confusing. *{{Rule[LogicalPlan]}}* a) Introduce the utility object {code:java} object rule def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): Rule[LogicalPlan] = new Rule[LogicalPlan] { def apply (plan: LogicalPlan): LogicalPlan = plan transform pf } def named(name: String)(pf: PartialFunction[LogicalPlan, LogicalPlan]): Rule[LogicalPlan] = new Rule[LogicalPlan] { override val ruleName = name def apply (plan: LogicalPlan): LogicalPlan = plan transform pf } {code} b) progressively replace the boilerplate-y object definitions; e.g. {code:java} object MyRewriteRule extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case ... = ... } {code} with {code:java} // define a Rule[LogicalPlan] val MyRewriteRule = rule { case ... = ... } {code} and/or : {code:java} // define a named Rule[LogicalPlan] val MyRewriteRule = rule.named(My rewrite rule) { case ... = ... } {code} *Strategies* A similar solution could be applied to shorten the code for Strategies, which are total functions only because they are all supposed to manage the default case, possibly returning `Nil`. In this case we might introduce the following utility: {code:java} object strategy { /** * Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan]. * The partial function must therefore return *one single* SparkPlan for each case. * The method will automatically wrap them in a [[Seq]]. * Unhandled cases will automatically return Seq.empty */ def apply(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy = new Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty } /** * Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] ]. * The partial function must therefore return a Seq[SparkPlan] for each case. * Unhandled cases will automatically return Seq.empty */ def seq(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy = new Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan] } } {code} Usage: {code:java} val mystrategy = strategy { case ... = ... } val seqstrategy = strategy.seq { case ... = ... } {code} *Further possible improvements:* Making the utility methods `implicit`, thereby further reducing the rewrite rules to: {code:java} // define a PartialFunction[LogicalPlan, LogicalPlan] // the implicit would convert it into a Rule[LogicalPlan] at the use sites val MyRewriteRule = { case ... = ... } {code} *Caveats* Because of the way objects are initialized vs. vals, it might be necessary reorder instructions so that vals are actually initialized before they are used. E.g.: {code:java} class MyOptimizer extends Optimizer { override val batches: Seq[Batch] = ... Batch(Other rules, FixedPoint(100), MyRewriteRule // --- might throw NPE val MyRewriteRule = ... } {code} this is easily fixed by factoring rules out as follows: {code:java} class MyOptimizer extends Optimizer with CustomRules { val MyRewriteRule = ... override val batches:
[jira] [Commented] (SPARK-7737) parquet schema discovery should not fail because of empty _temporary dir
[ https://issues.apache.org/jira/browse/SPARK-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554972#comment-14554972 ] Apache Spark commented on SPARK-7737: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/6329 parquet schema discovery should not fail because of empty _temporary dir - Key: SPARK-7737 URL: https://issues.apache.org/jira/browse/SPARK-7737 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Parquet schema discovery will fail when the dir is like {code} /partitions5k/i=2/_SUCCESS /partitions5k/i=2/_temporary/ /partitions5k/i=2/part-r-1.gz.parquet /partitions5k/i=2/part-r-2.gz.parquet /partitions5k/i=2/part-r-3.gz.parquet /partitions5k/i=2/part-r-4.gz.parquet {code} {code} java.lang.AssertionError: assertion failed: Conflicting partition column names detected: at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:159) at org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:71) at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:468) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:424) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:423) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:422) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:482) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:480) at org.apache.spark.sql.sources.LogicalRelation.init(LogicalRelation.scala:30) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:134) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:118) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1135) {code} 1.3 works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7737) parquet schema discovery should not fail because of empty _temporary dir
[ https://issues.apache.org/jira/browse/SPARK-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7737: --- Assignee: Apache Spark (was: Yin Huai) parquet schema discovery should not fail because of empty _temporary dir - Key: SPARK-7737 URL: https://issues.apache.org/jira/browse/SPARK-7737 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Apache Spark Priority: Blocker Parquet schema discovery will fail when the dir is like {code} /partitions5k/i=2/_SUCCESS /partitions5k/i=2/_temporary/ /partitions5k/i=2/part-r-1.gz.parquet /partitions5k/i=2/part-r-2.gz.parquet /partitions5k/i=2/part-r-3.gz.parquet /partitions5k/i=2/part-r-4.gz.parquet {code} {code} java.lang.AssertionError: assertion failed: Conflicting partition column names detected: at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:159) at org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:71) at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:468) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:424) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:423) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:422) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:482) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:480) at org.apache.spark.sql.sources.LogicalRelation.init(LogicalRelation.scala:30) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:134) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:118) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1135) {code} 1.3 works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7737) parquet schema discovery should not fail because of empty _temporary dir
[ https://issues.apache.org/jira/browse/SPARK-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7737: --- Assignee: Yin Huai (was: Apache Spark) parquet schema discovery should not fail because of empty _temporary dir - Key: SPARK-7737 URL: https://issues.apache.org/jira/browse/SPARK-7737 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Parquet schema discovery will fail when the dir is like {code} /partitions5k/i=2/_SUCCESS /partitions5k/i=2/_temporary/ /partitions5k/i=2/part-r-1.gz.parquet /partitions5k/i=2/part-r-2.gz.parquet /partitions5k/i=2/part-r-3.gz.parquet /partitions5k/i=2/part-r-4.gz.parquet {code} {code} java.lang.AssertionError: assertion failed: Conflicting partition column names detected: at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:159) at org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:71) at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:468) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:424) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:423) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:422) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:482) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:480) at org.apache.spark.sql.sources.LogicalRelation.init(LogicalRelation.scala:30) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:134) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:118) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1135) {code} 1.3 works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7616) Column order can be corrupted when saving DataFrame as a partitioned table
[ https://issues.apache.org/jira/browse/SPARK-7616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-7616. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6285 [https://github.com/apache/spark/pull/6285] Column order can be corrupted when saving DataFrame as a partitioned table -- Key: SPARK-7616 URL: https://issues.apache.org/jira/browse/SPARK-7616 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Cheng Lian Priority: Blocker Fix For: 1.4.0 When saved as a partitioned table, partition columns of a DataFrame are appended after data columns. However, column names are not adjusted accordingly. {code} import sqlContext._ import sqlContext.implicits._ val df = (1 to 3).map(i = i - i * 2).toDF(a, b) df.write .format(parquet) .mode(overwrite) .partitionBy(a) .saveAsTable(t) table(t).orderBy('a).show() {code} Expected output: {noformat} +-+-+ |b|a| +-+-+ |2|1| |4|2| |6|3| +-+-+ {noformat} Actual output: {noformat} +-+-+ |b|a| +-+-+ |1|2| |2|4| |3|6| +-+-+ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4924) Factor out code to launch Spark applications into a separate library
[ https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555062#comment-14555062 ] Jonathan Kelly commented on SPARK-4924: --- I tried upgrading my environment to Spark 1.4.0 and noticed that this JIRA has broken spark-jobserver because compute-classpath.sh has been deleted. spark-jobserver currently uses compute-classpath.sh in its server_start.sh script (see https://github.com/spark-jobserver/spark-jobserver/blob/master/bin/server_start.sh) in order to make sure it includes the same classpath that Spark uses. What should be the alternative now that compute-classpath.sh is gone? Factor out code to launch Spark applications into a separate library Key: SPARK-4924 URL: https://issues.apache.org/jira/browse/SPARK-4924 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.4.0 Attachments: spark-launcher.txt One of the questions we run into rather commonly is how to start a Spark application from my Java/Scala program?. There currently isn't a good answer to that: - Instantiating SparkContext has limitations (e.g., you can only have one active context at the moment, plus you lose the ability to submit apps in cluster mode) - Calling SparkSubmit directly is doable but you lose a lot of the logic handled by the shell scripts - Calling the shell script directly is doable, but sort of ugly from an API point of view. I think it would be nice to have a small library that handles that for users. On top of that, this library could be used by Spark itself to replace a lot of the code in the current shell scripts, which have a lot of duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7446) Inverse transform for StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-7446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555064#comment-14555064 ] holdenk commented on SPARK-7446: I can do this one :) Inverse transform for StringIndexer --- Key: SPARK-7446 URL: https://issues.apache.org/jira/browse/SPARK-7446 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Priority: Minor It is useful to convert the encoded indices back to their string representation for result inspection. We can add a parameter to StringIndexer/StringIndexModel for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7804) Incorrect results from JDBCRDD -- one record repeatly and incorrect field value
Paul Wu created SPARK-7804: -- Summary: Incorrect results from JDBCRDD -- one record repeatly and incorrect field value Key: SPARK-7804 URL: https://issues.apache.org/jira/browse/SPARK-7804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.3.0 Reporter: Paul Wu Getting only one record repeated in the RDD and repeated field value: I have a table like: attuid name email 12 john j...@appp.com 23 tom t...@appp.com 34 tony t...@appp.com My code: JavaSparkContext sc = new JavaSparkContext(sparkConf); String url = ; java.util.Properties prop = new Properties(); ListJDBCPartition partitionList = new ArrayList(); //int i; partitionList.add(new JDBCPartition(1=1, 0)); ListStructField fields = new ArrayListStructField(); fields.add(DataTypes.createStructField(attuid, DataTypes.StringType, true)); fields.add(DataTypes.createStructField(name, DataTypes.StringType, true)); fields.add(DataTypes.createStructField(email, DataTypes.StringType, true)); StructType schema = DataTypes.createStructType(fields); JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(), JDBCRDD.getConnector(oracle.jdbc.OracleDriver, url, prop), schema, USERS, new String[]{attuid, name, email}, new Filter[]{ }, partitionList.toArray(new JDBCPartition[0]) ); System.out.println(count before to Java RDD= + jdbcRDD.cache().count()); JavaRDDRow jrdd = jdbcRDD.toJavaRDD(); System.out.println(count= + jrdd.count()); ListRow lr = jrdd.collect(); for (Row r : lr) { for (int ii = 0; ii r.length(); ii++) { System.out.println(r.getString(ii)); } } === result is : 34 34 t...@appp.com 34 34 t...@appp.com 34 34 t...@appp.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org