date:20150521


 [ 
https://issues.apache.org/jira/browse/SPARK-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7604:
---

Assignee: Apache Spark

 Python API for PCA and PCAModel
 ---

 Key: SPARK-7604
 URL: https://issues.apache.org/jira/browse/SPARK-7604
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang
Assignee: Apache Spark

 Python API for org.apache.spark.mllib.feature.PCA and 
 org.apache.spark.mllib.feature.PCAModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7604) Python API for PCA and PCAModel


 [ 
https://issues.apache.org/jira/browse/SPARK-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7604:
---

Assignee: (was: Apache Spark)

 Python API for PCA and PCAModel
 ---

 Key: SPARK-7604
 URL: https://issues.apache.org/jira/browse/SPARK-7604
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang

 Python API for org.apache.spark.mllib.feature.PCA and 
 org.apache.spark.mllib.feature.PCAModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7604) Python API for PCA and PCAModel


[ 
https://issues.apache.org/jira/browse/SPARK-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553821#comment-14553821
 ] 

Apache Spark commented on SPARK-7604:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6315

 Python API for PCA and PCAModel
 ---

 Key: SPARK-7604
 URL: https://issues.apache.org/jira/browse/SPARK-7604
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang

 Python API for org.apache.spark.mllib.feature.PCA and 
 org.apache.spark.mllib.feature.PCAModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7322) Add DataFrame DSL for window function support


[ 
https://issues.apache.org/jira/browse/SPARK-7322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553865#comment-14553865
 ] 

Reynold Xin commented on SPARK-7322:


Also please make sure you add a unit test to JavaDataFrameSuite to make sure 
this is usable in Java.

 Add DataFrame DSL for window function support
 -

 Key: SPARK-7322
 URL: https://issues.apache.org/jira/browse/SPARK-7322
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao
  Labels: DataFrame

 Here's a proposal for supporting window functions in the DataFrame DSL:
 1. Add an over function to Column:
 {code}
 class Column {
   ...
   def over(window: Window): Column
   ...
 }
 {code}
 2. Window:
 {code}
 object Window {
   def partitionBy(...): Window
   def orderBy(...): Window
   object Frame {
 def unbounded: Frame
 def preceding(n: Long): Frame
 def following(n: Long): Frame
   }
   class Frame
 }
 class Window {
   def orderBy(...): Window
   def rowsBetween(Frame, Frame): Window
   def rangeBetween(Frame, Frame): Window  // maybe add this later
 }
 {code}
 Here's an example to use it:
 {code}
 df.select(
   avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”)
 .rowsBetween(Frame.unbounded, Frame.currentRow))
 )
 df.select(
   avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”)
 .rowsBetween(Frame.preceding(50), Frame.following(10)))
 )
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7717) Spark Standalone Web UI showing incorrect total memory, workers and cores


[ 
https://issues.apache.org/jira/browse/SPARK-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553890#comment-14553890
 ] 

Apache Spark commented on SPARK-7717:
-

User 'zhichao-li' has created a pull request for this issue:
https://github.com/apache/spark/pull/6317

 Spark Standalone Web UI showing incorrect total memory, workers and cores
 -

 Key: SPARK-7717
 URL: https://issues.apache.org/jira/browse/SPARK-7717
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.1
 Environment: RedHat
Reporter: Swaranga Sarma
Priority: Minor
  Labels: web-ui
 Attachments: JIRA.PNG


 I launched a Spark master in standalone mode in one of my host and then 
 launched 3 workers on three different hosts. The workers successfully 
 connected to my master and the Web UI showed the correct details. 
 Specifically, the Web UI correctly shows that the total memory and the total 
 cores available for the cluster.
 However on one of the worker, I did a kill -9 worker process id and 
 restarted the worker again. This time though, the master's Web UI shows 
 incorrect total memory and number of cores. The total memory is shown to be 
 4*n, where n is the memory in each worker. Also the total workers is shown 
 as 4 and the total number of cores shown is incorrect, it shows 4*c, where 
 c is the number of cores on each worker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7717) Spark Standalone Web UI showing incorrect total memory, workers and cores


 [ 
https://issues.apache.org/jira/browse/SPARK-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7717:
---

Assignee: (was: Apache Spark)

 Spark Standalone Web UI showing incorrect total memory, workers and cores
 -

 Key: SPARK-7717
 URL: https://issues.apache.org/jira/browse/SPARK-7717
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.1
 Environment: RedHat
Reporter: Swaranga Sarma
Priority: Minor
  Labels: web-ui
 Attachments: JIRA.PNG


 I launched a Spark master in standalone mode in one of my host and then 
 launched 3 workers on three different hosts. The workers successfully 
 connected to my master and the Web UI showed the correct details. 
 Specifically, the Web UI correctly shows that the total memory and the total 
 cores available for the cluster.
 However on one of the worker, I did a kill -9 worker process id and 
 restarted the worker again. This time though, the master's Web UI shows 
 incorrect total memory and number of cores. The total memory is shown to be 
 4*n, where n is the memory in each worker. Also the total workers is shown 
 as 4 and the total number of cores shown is incorrect, it shows 4*c, where 
 c is the number of cores on each worker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7788) Streaming | Kinesis | KinesisReceiver blocks in onStart


 [ 
https://issues.apache.org/jira/browse/SPARK-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7788:
-
Target Version/s: 1.4.0

 Streaming | Kinesis | KinesisReceiver blocks in onStart
 ---

 Key: SPARK-7788
 URL: https://issues.apache.org/jira/browse/SPARK-7788
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0, 1.3.1
Reporter: Aniket Bhatnagar
Assignee: Tathagata Das
  Labels: kinesis

 KinesisReceiver calls worker.run() which is a blocking call (while loop) as 
 per source code of kinesis-client library - 
 https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java.
 This results in infinite loop while calling 
 sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) 
 perhaps because ReceiverTracker is never able to register the receiver (it's 
 receiverInfo field is a empty map) causing it to be stuck in infinite loop 
 while waiting for running flag to be set to false. 
 Also, we should investigate a way to have receiver restart in case of 
 failures. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7788) Streaming | Kinesis | KinesisReceiver blocks in onStart


 [ 
https://issues.apache.org/jira/browse/SPARK-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7788:
-
Priority: Blocker  (was: Major)

 Streaming | Kinesis | KinesisReceiver blocks in onStart
 ---

 Key: SPARK-7788
 URL: https://issues.apache.org/jira/browse/SPARK-7788
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0, 1.3.1
Reporter: Aniket Bhatnagar
Assignee: Tathagata Das
Priority: Blocker
  Labels: kinesis

 KinesisReceiver calls worker.run() which is a blocking call (while loop) as 
 per source code of kinesis-client library - 
 https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java.
 This results in infinite loop while calling 
 sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) 
 perhaps because ReceiverTracker is never able to register the receiver (it's 
 receiverInfo field is a empty map) causing it to be stuck in infinite loop 
 while waiting for running flag to be set to false. 
 Also, we should investigate a way to have receiver restart in case of 
 failures. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7758) Failed to start thrift server when metastore is postgre sql


 [ 
https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7758:
---

Assignee: Apache Spark

 Failed to start thrift server when metastore is postgre sql
 ---

 Key: SPARK-7758
 URL: https://issues.apache.org/jira/browse/SPARK-7758
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Tao Wang
Assignee: Apache Spark
Priority: Critical
 Attachments: hive-site.xml, with error.log, with no error.log


 I am using today's master branch to start thrift server with setting 
 metastore to postgre sql, and it shows error like:
 `15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE
 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE 
 DELETEME1432125837197 CASCADE : Syntax error: Encountered CASCADE at line 
 1, column 34.
 java.sql.SQLSyntaxErrorException: Syntax error: Encountered CASCADE at line 
 1, column 34.
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
 Source)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
 Source)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
   at 
 org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
   at 
 org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)`
 But it works well with earlier master branch (on 7th, April).
 After printing their debug level log, I found current branch tries to connect 
 with derby but didn't know why, maybe the big reconstructure in sql module 
 cause this issue.
 The Datastore shows in current branch:
 15/05/20 20:43:57 DEBUG Datastore: === Datastore 
 =
 15/05/20 20:43:57 DEBUG Datastore: StoreManager : rdbms 
 (org.datanucleus.store.rdbms.RDBMSStoreManager)
 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write
 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), 
 Validate(None)
 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, 
 STOREDPROC]
 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0
 15/05/20 20:43:57 DEBUG Datastore: 
 ===
 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : 
 org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter
 15/05/20 20:43:57 DEBUG Datastore: Datastore : name=Apache Derby 
 version=10.10.1.1 - (1458268)
 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name=Apache Derby 
 Embedded JDBC Driver version=10.10.1.1 - (1458268)
 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : 
 URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : 
 URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : 
 factory=datanucleus1 case=UPPERCASE catalog= schema=SPARK
 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : MixedCase 
 UPPERCASE MixedCase-Sensitive 
 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : 
 Table=128 Column=128 Constraint=128 Index=128 Delimiter=
 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : 
 catalog=false schema=true
 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, 
 rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL
 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes 
 (max-batch-size=50)
 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, 
 type=forward-only, concurrency=read-only
 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255
 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, 
 DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, 
 VARBINARY, JAVA_OBJECT
 15/05/20 20:43:57 DEBUG Datastore: 
 ===
 The Datastore in earlier master branch:
 15/05/20 20:18:10 DEBUG Datastore: === Datastore 
 =
 15/05/20 20:18:10 DEBUG Datastore: StoreManager : rdbms

[jira] [Created] (SPARK-7787) SerializableAWSCredentials in KinesisReceiver cannot be deserialized

Tathagata Das created SPARK-7787:


 Summary: SerializableAWSCredentials in KinesisReceiver cannot be 
deserialized 
 Key: SPARK-7787
 URL: https://issues.apache.org/jira/browse/SPARK-7787
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker


Lack of default constructor causes deserialization to fail. This occurs only 
when the AWS credentials are explicitly specified through KinesisUtils.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7787) SerializableAWSCredentials in KinesisReceiver cannot be deserialized


[ 
https://issues.apache.org/jira/browse/SPARK-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553841#comment-14553841
 ] 

Apache Spark commented on SPARK-7787:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/6316

 SerializableAWSCredentials in KinesisReceiver cannot be deserialized 
 -

 Key: SPARK-7787
 URL: https://issues.apache.org/jira/browse/SPARK-7787
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 Lack of default constructor causes deserialization to fail. This occurs only 
 when the AWS credentials are explicitly specified through KinesisUtils.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7787) SerializableAWSCredentials in KinesisReceiver cannot be deserialized


 [ 
https://issues.apache.org/jira/browse/SPARK-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7787:
---

Assignee: Tathagata Das  (was: Apache Spark)

 SerializableAWSCredentials in KinesisReceiver cannot be deserialized 
 -

 Key: SPARK-7787
 URL: https://issues.apache.org/jira/browse/SPARK-7787
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 Lack of default constructor causes deserialization to fail. This occurs only 
 when the AWS credentials are explicitly specified through KinesisUtils.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7787) SerializableAWSCredentials in KinesisReceiver cannot be deserialized


 [ 
https://issues.apache.org/jira/browse/SPARK-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7787:
---

Assignee: Apache Spark  (was: Tathagata Das)

 SerializableAWSCredentials in KinesisReceiver cannot be deserialized 
 -

 Key: SPARK-7787
 URL: https://issues.apache.org/jira/browse/SPARK-7787
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Apache Spark
Priority: Blocker

 Lack of default constructor causes deserialization to fail. This occurs only 
 when the AWS credentials are explicitly specified through KinesisUtils.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7754) Use PartialFunction literals instead of objects in Catalyst

2015-05-21 Thread Edoardo Vacchi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553849#comment-14553849
 ] 

Edoardo Vacchi edited comment on SPARK-7754 at 5/21/15 8:15 AM:


That's a fair observation; on the other hand:

1) is the code concerning rule application expecting throws at all? 
2) RuleExecutor traces rules by their ruleName; if you really need named rules, 
you could use `rule.named(...) { }}`
   this does not solve the stack trace problem, but you'll still have 
meaningful information in the log



was (Author: evacchi):
That's a fair observation; on the other hand:

1) is the code concerning rule application expecting throws at all? 
2) RuleExecutor traces rules by their ruleName; if you really need named rules, 
you could use {{ rule.named(...) { } }}
   this does not solve the stack trace problem, but you'll still have 
meaningful information in the log


 Use PartialFunction literals instead of objects in Catalyst
 ---

 Key: SPARK-7754
 URL: https://issues.apache.org/jira/browse/SPARK-7754
 Project: Spark
  Issue Type: Improvement
Reporter: Edoardo Vacchi

 Catalyst rules extend two distinct rule types: {{Rule[LogicalPlan]}} and 
 {{Strategy}} (which is an alias for {{GenericStrategy[SparkPlan]}}).
 The distinction is fairly subtle: in the end, both rule types are supposed to 
 define a method {{apply(plan: LogicalPlan)}}
 (where LogicalPlan is either Logical- or Spark-) which returns a transformed 
 plan (or a sequence thereof, in the case
 of Strategy).
 Ceremonies asides, the body of such method is always of the kind:
 {code:java}
  def apply(plan: PlanType) = plan match pf
 {code}
 where `pf` would be some `PartialFunction` of the PlanType:
 {code:java}
   val pf = {
 case ... = ...
   }
 {code}
 This is JIRA is a proposal to introduce utility methods to
   a) reduce the boilerplate to define rewrite rules
   b) turning them back into what they essentially represent: function types.
 These changes would be backwards compatible, and would greatly help in 
 understanding what the code does. Current use of objects is redundant and 
 possibly confusing.
 *{{Rule[LogicalPlan]}}*
 a) Introduce the utility object
 {code:java}
   object rule 
 def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
 Rule[LogicalPlan] =
   new Rule[LogicalPlan] {
 def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
   }
 def named(name: String)(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
 Rule[LogicalPlan] =
   new Rule[LogicalPlan] {
 override val ruleName = name
 def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
   }
 {code}
 b) progressively replace the boilerplate-y object definitions; e.g.
 {code:java}
 object MyRewriteRule extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case ... = ...
 }
 {code}
 with
 {code:java}
 // define a Rule[LogicalPlan]
 val MyRewriteRule = rule {
   case ... = ...
 }
 {code}
 and/or :
 {code:java}
 // define a named Rule[LogicalPlan]
 val MyRewriteRule = rule.named(My rewrite rule) {
   case ... = ...
 }
 {code}
 *Strategies*
 A similar solution could be applied to shorten the code for
 Strategies, which are total functions
 only because they are all supposed to manage the default case,
 possibly returning `Nil`. In this case
 we might introduce the following utility:
 {code:java}
 object strategy {
   /**
* Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan].
* The partial function must therefore return *one single* SparkPlan for 
 each case.
* The method will automatically wrap them in a [[Seq]].
* Unhandled cases will automatically return Seq.empty
*/
   def apply(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy =
 new Strategy {
   def apply(plan: LogicalPlan): Seq[SparkPlan] =
 if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty
 }
   /**
* Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] 
 ].
* The partial function must therefore return a Seq[SparkPlan] for each 
 case.
* Unhandled cases will automatically return Seq.empty
*/
  def seq(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy =
 new Strategy {
   def apply(plan: LogicalPlan): Seq[SparkPlan] =
 if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan]
 }
 }
 {code}
 Usage:
 {code:java}
 val mystrategy = strategy { case ... = ... }
 val seqstrategy = strategy.seq { case ... = ... }
 {code}
 *Further possible improvements:*
 Making the utility methods `implicit`, thereby
 further reducing the rewrite rules to:
 {code:java}
 //

[jira] [Created] (SPARK-7788) Streaming | Kinesis | KinesisReceiver blocks in onStart

2015-05-21 Thread Aniket Bhatnagar (JIRA)

Aniket Bhatnagar created SPARK-7788:
---

 Summary: Streaming | Kinesis | KinesisReceiver blocks in onStart
 Key: SPARK-7788
 URL: https://issues.apache.org/jira/browse/SPARK-7788
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1, 1.3.0
Reporter: Aniket Bhatnagar


KinesisReceiver calls worker.run() which is a blocking call (while loop) as per 
source code of kinesis-client library - 
https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java.

This results in infinite loop while calling 
sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) 
perhaps because ReceiverTracker is never able to register the receiver (it's 
receiverInfo field is a empty map) causing it to be stuck in infinite loop 
while waiting for running flag to be set to false. 

Also, we should investigate a way to have receiver restart in case of failures. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7788) Streaming | Kinesis | KinesisReceiver blocks in onStart


 [ 
https://issues.apache.org/jira/browse/SPARK-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-7788:


Assignee: Tathagata Das

 Streaming | Kinesis | KinesisReceiver blocks in onStart
 ---

 Key: SPARK-7788
 URL: https://issues.apache.org/jira/browse/SPARK-7788
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0, 1.3.1
Reporter: Aniket Bhatnagar
Assignee: Tathagata Das
  Labels: kinesis

 KinesisReceiver calls worker.run() which is a blocking call (while loop) as 
 per source code of kinesis-client library - 
 https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java.
 This results in infinite loop while calling 
 sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) 
 perhaps because ReceiverTracker is never able to register the receiver (it's 
 receiverInfo field is a empty map) causing it to be stuck in infinite loop 
 while waiting for running flag to be set to false. 
 Also, we should investigate a way to have receiver restart in case of 
 failures. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7717) Spark Standalone Web UI showing incorrect total memory, workers and cores


 [ 
https://issues.apache.org/jira/browse/SPARK-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7717:
---

Assignee: Apache Spark

 Spark Standalone Web UI showing incorrect total memory, workers and cores
 -

 Key: SPARK-7717
 URL: https://issues.apache.org/jira/browse/SPARK-7717
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.1
 Environment: RedHat
Reporter: Swaranga Sarma
Assignee: Apache Spark
Priority: Minor
  Labels: web-ui
 Attachments: JIRA.PNG


 I launched a Spark master in standalone mode in one of my host and then 
 launched 3 workers on three different hosts. The workers successfully 
 connected to my master and the Web UI showed the correct details. 
 Specifically, the Web UI correctly shows that the total memory and the total 
 cores available for the cluster.
 However on one of the worker, I did a kill -9 worker process id and 
 restarted the worker again. This time though, the master's Web UI shows 
 incorrect total memory and number of cores. The total memory is shown to be 
 4*n, where n is the memory in each worker. Also the total workers is shown 
 as 4 and the total number of cores shown is incorrect, it shows 4*c, where 
 c is the number of cores on each worker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7765) Input vector should divide with the norm in Word2Vec's findSynonyms

2015-05-21 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553889#comment-14553889
 ] 

Liang-Chi Hsieh commented on SPARK-7765:


I will try to close some obsolete PRs. Some PRs are there waiting for related 
APIs to be ready or reviewing from others. I usually will update them soon if 
they get responses.

 Input vector should divide with the norm in Word2Vec's findSynonyms
 ---

 Key: SPARK-7765
 URL: https://issues.apache.org/jira/browse/SPARK-7765
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Liang-Chi Hsieh

 In Word2Vec's findSynonyms, the computed cosine similarities should divide 
 with the norm of the given vector since it is not normalized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7758) Failed to start thrift server when metastore is postgre sql


[ 
https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553791#comment-14553791
 ] 

Apache Spark commented on SPARK-7758:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/6314

 Failed to start thrift server when metastore is postgre sql
 ---

 Key: SPARK-7758
 URL: https://issues.apache.org/jira/browse/SPARK-7758
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Tao Wang
Priority: Critical
 Attachments: hive-site.xml, with error.log, with no error.log


 I am using today's master branch to start thrift server with setting 
 metastore to postgre sql, and it shows error like:
 `15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE
 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE 
 DELETEME1432125837197 CASCADE : Syntax error: Encountered CASCADE at line 
 1, column 34.
 java.sql.SQLSyntaxErrorException: Syntax error: Encountered CASCADE at line 
 1, column 34.
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
 Source)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
 Source)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
   at 
 org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
   at 
 org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)`
 But it works well with earlier master branch (on 7th, April).
 After printing their debug level log, I found current branch tries to connect 
 with derby but didn't know why, maybe the big reconstructure in sql module 
 cause this issue.
 The Datastore shows in current branch:
 15/05/20 20:43:57 DEBUG Datastore: === Datastore 
 =
 15/05/20 20:43:57 DEBUG Datastore: StoreManager : rdbms 
 (org.datanucleus.store.rdbms.RDBMSStoreManager)
 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write
 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), 
 Validate(None)
 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, 
 STOREDPROC]
 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0
 15/05/20 20:43:57 DEBUG Datastore: 
 ===
 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : 
 org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter
 15/05/20 20:43:57 DEBUG Datastore: Datastore : name=Apache Derby 
 version=10.10.1.1 - (1458268)
 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name=Apache Derby 
 Embedded JDBC Driver version=10.10.1.1 - (1458268)
 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : 
 URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : 
 URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : 
 factory=datanucleus1 case=UPPERCASE catalog= schema=SPARK
 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : MixedCase 
 UPPERCASE MixedCase-Sensitive 
 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : 
 Table=128 Column=128 Constraint=128 Index=128 Delimiter=
 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : 
 catalog=false schema=true
 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, 
 rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL
 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes 
 (max-batch-size=50)
 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, 
 type=forward-only, concurrency=read-only
 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255
 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, 
 DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, 
 VARBINARY, JAVA_OBJECT
 15/05/20 20:43:57 DEBUG Datastore: 
 ===
 The Datastore in earlier master branch:
 15/05/20 20:18:10 DEBUG Datastore: === Datastore

[jira] [Assigned] (SPARK-7758) Failed to start thrift server when metastore is postgre sql


 [ 
https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7758:
---

Assignee: (was: Apache Spark)

 Failed to start thrift server when metastore is postgre sql
 ---

 Key: SPARK-7758
 URL: https://issues.apache.org/jira/browse/SPARK-7758
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Tao Wang
Priority: Critical
 Attachments: hive-site.xml, with error.log, with no error.log


 I am using today's master branch to start thrift server with setting 
 metastore to postgre sql, and it shows error like:
 `15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE
 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE 
 DELETEME1432125837197 CASCADE : Syntax error: Encountered CASCADE at line 
 1, column 34.
 java.sql.SQLSyntaxErrorException: Syntax error: Encountered CASCADE at line 
 1, column 34.
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
 Source)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
 Source)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
   at 
 org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
   at 
 org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)`
 But it works well with earlier master branch (on 7th, April).
 After printing their debug level log, I found current branch tries to connect 
 with derby but didn't know why, maybe the big reconstructure in sql module 
 cause this issue.
 The Datastore shows in current branch:
 15/05/20 20:43:57 DEBUG Datastore: === Datastore 
 =
 15/05/20 20:43:57 DEBUG Datastore: StoreManager : rdbms 
 (org.datanucleus.store.rdbms.RDBMSStoreManager)
 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write
 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), 
 Validate(None)
 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, 
 STOREDPROC]
 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0
 15/05/20 20:43:57 DEBUG Datastore: 
 ===
 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : 
 org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter
 15/05/20 20:43:57 DEBUG Datastore: Datastore : name=Apache Derby 
 version=10.10.1.1 - (1458268)
 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name=Apache Derby 
 Embedded JDBC Driver version=10.10.1.1 - (1458268)
 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : 
 URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : 
 URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : 
 factory=datanucleus1 case=UPPERCASE catalog= schema=SPARK
 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : MixedCase 
 UPPERCASE MixedCase-Sensitive 
 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : 
 Table=128 Column=128 Constraint=128 Index=128 Delimiter=
 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : 
 catalog=false schema=true
 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, 
 rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL
 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes 
 (max-batch-size=50)
 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, 
 type=forward-only, concurrency=read-only
 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255
 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, 
 DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, 
 VARBINARY, JAVA_OBJECT
 15/05/20 20:43:57 DEBUG Datastore: 
 ===
 The Datastore in earlier master branch:
 15/05/20 20:18:10 DEBUG Datastore: === Datastore 
 =
 15/05/20 20:18:10 DEBUG Datastore: StoreManager : rdbms

[jira] [Created] (SPARK-7786) Allow StreamingListener to be specified in SparkConf and loaded when creating StreamingContext

2015-05-21 Thread yangping wu (JIRA)

yangping wu created SPARK-7786:
--

 Summary: Allow StreamingListener to be specified in SparkConf and 
loaded when creating StreamingContext
 Key: SPARK-7786
 URL: https://issues.apache.org/jira/browse/SPARK-7786
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.3.1
Reporter: yangping wu


As  mentioned in [SPARK-5411|https://issues.apache.org/jira/browse/SPARK-5411], 
We can also allow user to register StreamingListener  through SparkConf 
settings, and loaded when creating StreamingContext, This would allow 
monitoring frameworks to be easily injected into Spark programs without having 
to modify those programs' code.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7786) Allow StreamingListener to be specified in SparkConf and loaded when creating StreamingContext

2015-05-21 Thread yangping wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yangping wu updated SPARK-7786:
---
Priority: Minor  (was: Major)

 Allow StreamingListener to be specified in SparkConf and loaded when creating 
 StreamingContext
 --

 Key: SPARK-7786
 URL: https://issues.apache.org/jira/browse/SPARK-7786
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.3.1
Reporter: yangping wu
Priority: Minor

 As  mentioned in 
 [SPARK-5411|https://issues.apache.org/jira/browse/SPARK-5411], We can also 
 allow user to register StreamingListener  through SparkConf settings, and 
 loaded when creating StreamingContext, This would allow monitoring frameworks 
 to be easily injected into Spark programs without having to modify those 
 programs' code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7754) Use PartialFunction literals instead of objects in Catalyst

2015-05-21 Thread Edoardo Vacchi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553849#comment-14553849
 ] 

Edoardo Vacchi commented on SPARK-7754:
---

That's a fair observation; on the other hand:

1) is the code concerning rule application expecting throws at all? 
2) RuleExecutor traces rules by their ruleName; if you really need named rules, 
you could use {{ rule.named(...) { } }}
   this does not solve the stack trace problem, but you'll still have 
meaningful information in the log


 Use PartialFunction literals instead of objects in Catalyst
 ---

 Key: SPARK-7754
 URL: https://issues.apache.org/jira/browse/SPARK-7754
 Project: Spark
  Issue Type: Improvement
Reporter: Edoardo Vacchi

 Catalyst rules extend two distinct rule types: {{Rule[LogicalPlan]}} and 
 {{Strategy}} (which is an alias for {{GenericStrategy[SparkPlan]}}).
 The distinction is fairly subtle: in the end, both rule types are supposed to 
 define a method {{apply(plan: LogicalPlan)}}
 (where LogicalPlan is either Logical- or Spark-) which returns a transformed 
 plan (or a sequence thereof, in the case
 of Strategy).
 Ceremonies asides, the body of such method is always of the kind:
 {code:java}
  def apply(plan: PlanType) = plan match pf
 {code}
 where `pf` would be some `PartialFunction` of the PlanType:
 {code:java}
   val pf = {
 case ... = ...
   }
 {code}
 This is JIRA is a proposal to introduce utility methods to
   a) reduce the boilerplate to define rewrite rules
   b) turning them back into what they essentially represent: function types.
 These changes would be backwards compatible, and would greatly help in 
 understanding what the code does. Current use of objects is redundant and 
 possibly confusing.
 *{{Rule[LogicalPlan]}}*
 a) Introduce the utility object
 {code:java}
   object rule 
 def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
 Rule[LogicalPlan] =
   new Rule[LogicalPlan] {
 def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
   }
 def named(name: String)(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
 Rule[LogicalPlan] =
   new Rule[LogicalPlan] {
 override val ruleName = name
 def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
   }
 {code}
 b) progressively replace the boilerplate-y object definitions; e.g.
 {code:java}
 object MyRewriteRule extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case ... = ...
 }
 {code}
 with
 {code:java}
 // define a Rule[LogicalPlan]
 val MyRewriteRule = rule {
   case ... = ...
 }
 {code}
 and/or :
 {code:java}
 // define a named Rule[LogicalPlan]
 val MyRewriteRule = rule.named(My rewrite rule) {
   case ... = ...
 }
 {code}
 *Strategies*
 A similar solution could be applied to shorten the code for
 Strategies, which are total functions
 only because they are all supposed to manage the default case,
 possibly returning `Nil`. In this case
 we might introduce the following utility:
 {code:java}
 object strategy {
   /**
* Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan].
* The partial function must therefore return *one single* SparkPlan for 
 each case.
* The method will automatically wrap them in a [[Seq]].
* Unhandled cases will automatically return Seq.empty
*/
   def apply(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy =
 new Strategy {
   def apply(plan: LogicalPlan): Seq[SparkPlan] =
 if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty
 }
   /**
* Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] 
 ].
* The partial function must therefore return a Seq[SparkPlan] for each 
 case.
* Unhandled cases will automatically return Seq.empty
*/
  def seq(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy =
 new Strategy {
   def apply(plan: LogicalPlan): Seq[SparkPlan] =
 if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan]
 }
 }
 {code}
 Usage:
 {code:java}
 val mystrategy = strategy { case ... = ... }
 val seqstrategy = strategy.seq { case ... = ... }
 {code}
 *Further possible improvements:*
 Making the utility methods `implicit`, thereby
 further reducing the rewrite rules to:
 {code:java}
 // define a PartialFunction[LogicalPlan, LogicalPlan]
 // the implicit would convert it into a Rule[LogicalPlan] at the use sites
 val MyRewriteRule = {
   case ... = ...
 }
 {code}
 *Caveats*
 Because of the way objects are initialized vs. vals, it might be necessary
 reorder instructions so that vals are actually initialized before they are 
 used.
 E.g.:
 {code:java}
 class MyOptimizer extends

[jira] [Updated] (SPARK-7322) Add DataFrame DSL for window function support


 [ 
https://issues.apache.org/jira/browse/SPARK-7322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7322:
---
Description: 
Here's a proposal for supporting window functions in the DataFrame DSL:

1. Add an over function to Column:
{code}
class Column {
  ...
  def over(window: Window): Column
  ...
}
{code}

2. Window:
{code}
object Window {
  def partitionBy(...): Window
  def orderBy(...): Window

  object Frame {
def unbounded: Frame
def preceding(n: Long): Frame
def following(n: Long): Frame
  }

  class Frame
}

class Window {
  def orderBy(...): Window
  def rowsBetween(Frame, Frame): Window
  def rangeBetween(Frame, Frame): Window  // maybe add this later
}
{code}

Here's an example to use it:
{code}
df.select(
  avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”)
.rowsBetween(Frame.unbounded, Frame.currentRow))
)

df.select(
  avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”)
.rowsBetween(Frame.preceding(50), Frame.following(10)))
)
{code}

  was:
Here's a proposal for supporting window functions in the DataFrame DSL:

1. Add an over function to Column:
{code}
class Column {
  ...
  def over(): WindowFunctionSpec
  ...
}
{code}

2. WindowFunctionSpec:
{code}
// By default frame = full partition
class WindowFunctionSpec {
  def partitionBy(cols: Column*): WindowFunctionSpec
  def orderBy(cols: Column*): WindowFunctionSpec

  // restrict frame beginning from current row - n position
  def rowsPreceding(n: Int): WindowFunctionSpec

  // restrict frame ending from current row - n position
  def rowsFollowing(n: Int): WindowFunctionSpec

  def rangePreceding(n: Int): WindowFunctionSpec
  def rowsFollowing(n: Int): WindowFunctionSpec
}
{code}

Here's an example to use it:
{code}
df.select(
  df.store,
  df.date,
  df.sales,
  avg(df.sales).over.partitionBy(df.store)
.orderBy(df.store) 
.rowsFollowing(0)// this means from unbounded preceding 
to current row
)
{code}


 Add DataFrame DSL for window function support
 -

 Key: SPARK-7322
 URL: https://issues.apache.org/jira/browse/SPARK-7322
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao
  Labels: DataFrame

 Here's a proposal for supporting window functions in the DataFrame DSL:
 1. Add an over function to Column:
 {code}
 class Column {
   ...
   def over(window: Window): Column
   ...
 }
 {code}
 2. Window:
 {code}
 object Window {
   def partitionBy(...): Window
   def orderBy(...): Window
   object Frame {
 def unbounded: Frame
 def preceding(n: Long): Frame
 def following(n: Long): Frame
   }
   class Frame
 }
 class Window {
   def orderBy(...): Window
   def rowsBetween(Frame, Frame): Window
   def rangeBetween(Frame, Frame): Window  // maybe add this later
 }
 {code}
 Here's an example to use it:
 {code}
 df.select(
   avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”)
 .rowsBetween(Frame.unbounded, Frame.currentRow))
 )
 df.select(
   avg(“age”).over(Window.partitionBy(“..”, “..”).orderBy(“..”, “..”)
 .rowsBetween(Frame.preceding(50), Frame.following(10)))
 )
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7754) Use PartialFunction literals instead of objects in Catalyst

2015-05-21 Thread Edoardo Vacchi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553849#comment-14553849
 ] 

Edoardo Vacchi edited comment on SPARK-7754 at 5/21/15 8:24 AM:


That's a fair observation; on the other hand:

1) is the code concerning rule application expecting throws at all? 
2) RuleExecutor traces rules by their ruleName; if you really need named rules, 
you could use `rule.named(...) { }}`
   this does not solve the stack trace problem, but you'll still have 
meaningful information in the log

also, this could be partially worked around by carefully modularizing rules 
within container objects/traits


was (Author: evacchi):
That's a fair observation; on the other hand:

1) is the code concerning rule application expecting throws at all? 
2) RuleExecutor traces rules by their ruleName; if you really need named rules, 
you could use `rule.named(...) { }}`
   this does not solve the stack trace problem, but you'll still have 
meaningful information in the log


 Use PartialFunction literals instead of objects in Catalyst
 ---

 Key: SPARK-7754
 URL: https://issues.apache.org/jira/browse/SPARK-7754
 Project: Spark
  Issue Type: Improvement
Reporter: Edoardo Vacchi

 Catalyst rules extend two distinct rule types: {{Rule[LogicalPlan]}} and 
 {{Strategy}} (which is an alias for {{GenericStrategy[SparkPlan]}}).
 The distinction is fairly subtle: in the end, both rule types are supposed to 
 define a method {{apply(plan: LogicalPlan)}}
 (where LogicalPlan is either Logical- or Spark-) which returns a transformed 
 plan (or a sequence thereof, in the case
 of Strategy).
 Ceremonies asides, the body of such method is always of the kind:
 {code:java}
  def apply(plan: PlanType) = plan match pf
 {code}
 where `pf` would be some `PartialFunction` of the PlanType:
 {code:java}
   val pf = {
 case ... = ...
   }
 {code}
 This is JIRA is a proposal to introduce utility methods to
   a) reduce the boilerplate to define rewrite rules
   b) turning them back into what they essentially represent: function types.
 These changes would be backwards compatible, and would greatly help in 
 understanding what the code does. Current use of objects is redundant and 
 possibly confusing.
 *{{Rule[LogicalPlan]}}*
 a) Introduce the utility object
 {code:java}
   object rule 
 def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
 Rule[LogicalPlan] =
   new Rule[LogicalPlan] {
 def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
   }
 def named(name: String)(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
 Rule[LogicalPlan] =
   new Rule[LogicalPlan] {
 override val ruleName = name
 def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
   }
 {code}
 b) progressively replace the boilerplate-y object definitions; e.g.
 {code:java}
 object MyRewriteRule extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case ... = ...
 }
 {code}
 with
 {code:java}
 // define a Rule[LogicalPlan]
 val MyRewriteRule = rule {
   case ... = ...
 }
 {code}
 and/or :
 {code:java}
 // define a named Rule[LogicalPlan]
 val MyRewriteRule = rule.named(My rewrite rule) {
   case ... = ...
 }
 {code}
 *Strategies*
 A similar solution could be applied to shorten the code for
 Strategies, which are total functions
 only because they are all supposed to manage the default case,
 possibly returning `Nil`. In this case
 we might introduce the following utility:
 {code:java}
 object strategy {
   /**
* Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan].
* The partial function must therefore return *one single* SparkPlan for 
 each case.
* The method will automatically wrap them in a [[Seq]].
* Unhandled cases will automatically return Seq.empty
*/
   def apply(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy =
 new Strategy {
   def apply(plan: LogicalPlan): Seq[SparkPlan] =
 if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty
 }
   /**
* Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] 
 ].
* The partial function must therefore return a Seq[SparkPlan] for each 
 case.
* Unhandled cases will automatically return Seq.empty
*/
  def seq(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy =
 new Strategy {
   def apply(plan: LogicalPlan): Seq[SparkPlan] =
 if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan]
 }
 }
 {code}
 Usage:
 {code:java}
 val mystrategy = strategy { case ... = ... }
 val seqstrategy = strategy.seq { case ... = ... }
 {code}
 *Further possible improvements:*

[jira] [Commented] (SPARK-7788) Streaming | Kinesis | KinesisReceiver blocks in onStart


[ 
https://issues.apache.org/jira/browse/SPARK-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553893#comment-14553893
 ] 

Tathagata Das commented on SPARK-7788:
--

This may be a good catch. I will try to look into it very soon and try to solve 
it for 1.4

 Streaming | Kinesis | KinesisReceiver blocks in onStart
 ---

 Key: SPARK-7788
 URL: https://issues.apache.org/jira/browse/SPARK-7788
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0, 1.3.1
Reporter: Aniket Bhatnagar
Assignee: Tathagata Das
Priority: Blocker
  Labels: kinesis

 KinesisReceiver calls worker.run() which is a blocking call (while loop) as 
 per source code of kinesis-client library - 
 https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java.
 This results in infinite loop while calling 
 sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) 
 perhaps because ReceiverTracker is never able to register the receiver (it's 
 receiverInfo field is a empty map) causing it to be stuck in infinite loop 
 while waiting for running flag to be set to false. 
 Also, we should investigate a way to have receiver restart in case of 
 failures. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7606) Document all PySpark SQL/DataFrame public methods with @since tag


 [ 
https://issues.apache.org/jira/browse/SPARK-7606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7606.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Document all PySpark SQL/DataFrame public methods with @since tag
 -

 Key: SPARK-7606
 URL: https://issues.apache.org/jira/browse/SPARK-7606
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Nicholas Chammas
Assignee: Davies Liu
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7784) Check 1.3- 1.4 SQL API compliance using java-compliance-checker

Xiangrui Meng created SPARK-7784:


 Summary: Check 1.3- 1.4 SQL API compliance using 
java-compliance-checker
 Key: SPARK-7784
 URL: https://issues.apache.org/jira/browse/SPARK-7784
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-05-21 Thread Kaveen Raajan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553733#comment-14553733
 ] 

Kaveen Raajan commented on SPARK-5389:
--

I had the same problem. I found the root cause for this issue. Your JAVA_HOME 
is not set correctly, Please set this in environment variable correctly. Make 
sure your java path is not containing any space.

OS - Windows 8 and Windows server 2008

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell, Windows
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG, spark_bug.png


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6013) Add more Python ML examples for spark.ml

2015-05-21 Thread Ram Sriharsha (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha reassigned SPARK-6013:


Assignee: Ram Sriharsha

 Add more Python ML examples for spark.ml
 

 Key: SPARK-6013
 URL: https://issues.apache.org/jira/browse/SPARK-6013
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Ram Sriharsha

 Now that the spark.ml Pipelines API is supported within Python, we should 
 duplicate the remaining Scala/Java spark.ml examples within Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7784) Check 1.3- 1.4 SQL API compliance using java-compliance-checker


 [ 
https://issues.apache.org/jira/browse/SPARK-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7784:
-
Attachment: compat_report.html

There are lots of false positives since package private classes/methods are 
public in Java. But at least there are no false negatives.

 Check 1.3- 1.4 SQL API compliance using java-compliance-checker
 

 Key: SPARK-7784
 URL: https://issues.apache.org/jira/browse/SPARK-7784
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Attachments: compat_report.html






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7753) Improve kernel density API


 [ 
https://issues.apache.org/jira/browse/SPARK-7753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7753.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6279
[https://github.com/apache/spark/pull/6279]

 Improve kernel density API
 --

 Key: SPARK-7753
 URL: https://issues.apache.org/jira/browse/SPARK-7753
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.0


 Kernel density estimation is provided in many statistics libraries: 
 http://en.wikipedia.org/wiki/Kernel_density_estimation#Statistical_implementation.
  We should make sure that we implement a similar API. The two most important 
 parameters of kernel density estimation are kernel type and bandwidth. 
 Besides density estimation, it is also used for smoothing. The current API is 
 designed only for Gaussian kernel and density estimation:
 {code}
 def kernelDensity(samples: RDD[Double], standardDeviation: Double, 
 evaluationPoints: Iterable[Double]): Array[Double]
 {code}
 It would be nice if we can come up with an extensible API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7784) Check 1.3- 1.4 SQL API compliance using java-compliance-checker


 [ 
https://issues.apache.org/jira/browse/SPARK-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7784.

   Resolution: Fixed
Fix Version/s: 1.4.0

I went through them quickly. Everything looks fine.


 Check 1.3- 1.4 SQL API compliance using java-compliance-checker
 

 Key: SPARK-7784
 URL: https://issues.apache.org/jira/browse/SPARK-7784
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.0

 Attachments: compat_report.html






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7785) Add missing items to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Manoj Kumar (JIRA)

Manoj Kumar created SPARK-7785:
--

 Summary: Add missing items to pyspark.mllib.linalg.Matrices
 Key: SPARK-7785
 URL: https://issues.apache.org/jira/browse/SPARK-7785
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar


For DenseMatrices.
Class Methods
__str__, transpose
Object Methods
zeros, ones, eye, rand, randn, diag

For SparseMatrices
Class Methods
__str__, transpose
Object Methods,
fromCoo, speye, sprand, sprandn, spdiag,

Matrices Methods, horzcat, vertcat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7785) Add missing items to pyspark.mllib.linalg.Matrices

2015-05-21 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553769#comment-14553769
 ] 

Manoj Kumar commented on SPARK-7785:


ping [~josephkb]

 Add missing items to pyspark.mllib.linalg.Matrices
 --

 Key: SPARK-7785
 URL: https://issues.apache.org/jira/browse/SPARK-7785
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar

 For DenseMatrices.
 Class Methods
 __str__, transpose
 Object Methods
 zeros, ones, eye, rand, randn, diag
 For SparseMatrices
 Class Methods
 __str__, transpose
 Object Methods,
 fromCoo, speye, sprand, sprandn, spdiag,
 Matrices Methods, horzcat, vertcat



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7320) Add rollup and cube support to DataFrame Java/Scala DSL


[ 
https://issues.apache.org/jira/browse/SPARK-7320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553696#comment-14553696
 ] 

Apache Spark commented on SPARK-7320:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6312

 Add rollup and cube support to DataFrame Java/Scala DSL
 ---

 Key: SPARK-7320
 URL: https://issues.apache.org/jira/browse/SPARK-7320
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao
  Labels: starter
 Fix For: 1.4.0


 We should add two functions to GroupedData in order to support rollup and 
 cube for the DataFrame DSL.
 {code}
 def rollup(): GroupedData
 def cube(): GroupedData
 {code}
 These two should return new GroupedData with the appropriate state set so 
 when we run an Aggregate, we translate the underlying logical operator into 
 Rollup or Cube.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6197) handle json parse exception for eventlog file not finished writing

2015-05-21 Thread Xia Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553703#comment-14553703
 ] 

Xia Hu commented on SPARK-6197:
---

Ok, get it. Thank you very much~~

 handle json parse exception for eventlog file not finished writing 
 ---

 Key: SPARK-6197
 URL: https://issues.apache.org/jira/browse/SPARK-6197
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Assignee: Zhang, Liye
Priority: Minor
  Labels: backport-needed
 Fix For: 1.4.0


 This is a following JIRA for 
 [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In  
 [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can 
 display event log files that with suffix *.inprogress*. However, the eventlog 
 file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In 
 which case, the file maybe  truncated in the last line, leading to the line 
 being not in valid Json format. Which will cause Json parse exception when 
 reading the file. 
 For this case, we can just ignore the last line content, since the history 
 for abnormal cases showed on web is only a reference for user, it can 
 demonstrate the past status of the app before terminated abnormally (we can 
 not guarantee the history can show exactly the last moment when app encounter 
 the abnormal situation). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7394) Add Pandas style cast (astype)


 [ 
https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7394:
---

Assignee: Apache Spark

 Add Pandas style cast (astype)
 --

 Key: SPARK-7394
 URL: https://issues.apache.org/jira/browse/SPARK-7394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
  Labels: starter

 Basically alias astype == cast in Column for Python (and Python only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7394) Add Pandas style cast (astype)


 [ 
https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7394:
---

Assignee: (was: Apache Spark)

 Add Pandas style cast (astype)
 --

 Key: SPARK-7394
 URL: https://issues.apache.org/jira/browse/SPARK-7394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter

 Basically alias astype == cast in Column for Python (and Python only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7394) Add Pandas style cast (astype)


[ 
https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553772#comment-14553772
 ] 

Apache Spark commented on SPARK-7394:
-

User 'kaka1992' has created a pull request for this issue:
https://github.com/apache/spark/pull/6313

 Add Pandas style cast (astype)
 --

 Key: SPARK-7394
 URL: https://issues.apache.org/jira/browse/SPARK-7394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter

 Basically alias astype == cast in Column for Python (and Python only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7404) Add RegressionEvaluator to spark.ml


 [ 
https://issues.apache.org/jira/browse/SPARK-7404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-7404:


Assignee: Xiangrui Meng

 Add RegressionEvaluator to spark.ml
 ---

 Key: SPARK-7404
 URL: https://issues.apache.org/jira/browse/SPARK-7404
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 This allows users to tune regression models using the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7745) Replace assertions with requires (IllegalArgumentException) and modify other state checks


 [ 
https://issues.apache.org/jira/browse/SPARK-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-7745.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

 Replace assertions with requires (IllegalArgumentException) and modify other 
 state checks
 -

 Key: SPARK-7745
 URL: https://issues.apache.org/jira/browse/SPARK-7745
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Burak Yavuz
Assignee: Burak Yavuz
Priority: Minor
 Fix For: 1.4.0


 Assertions can be turned off. Require throws an IllegalArgumentException, 
 which makes more sense, when it's a variable set by the user. Also, some 
 SparkException's can be changed to IllegalStateExceptions for the streaming 
 spark context state checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7745) Replace assertions with requires (IllegalArgumentException) and modify other state checks


 [ 
https://issues.apache.org/jira/browse/SPARK-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7745:
-
Assignee: Burak Yavuz

 Replace assertions with requires (IllegalArgumentException) and modify other 
 state checks
 -

 Key: SPARK-7745
 URL: https://issues.apache.org/jira/browse/SPARK-7745
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Burak Yavuz
Assignee: Burak Yavuz
Priority: Minor
 Fix For: 1.4.0


 Assertions can be turned off. Require throws an IllegalArgumentException, 
 which makes more sense, when it's a variable set by the user. Also, some 
 SparkException's can be changed to IllegalStateExceptions for the streaming 
 spark context state checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients


 [ 
https://issues.apache.org/jira/browse/SPARK-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-7789:

Description: 
After creating a hbase table in beeline, then execute select sql statement, 
Executor occurs the exception:
bq.java.lang.IllegalStateException: Error while configuring input job properties
at 
org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343)
at 
org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279)
at 
org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804)
at 
org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774)
at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:220)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.security.AccessDeniedException: 
org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only 
allowed for Kerberos authenticated clients
at 
org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:124)
at 
org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService$1.getAuthenticationToken(AuthenticationProtos.java:4267)
at 
org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService.callMethod(AuthenticationProtos.java:4387)
at 
org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7696)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1877)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1859)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32209)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2131)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:102)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:326)

[jira] [Created] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients

meiyoula created SPARK-7789:
---

 Summary: sql  on security hbase:Token generation only allowed for 
Kerberos authenticated clients
 Key: SPARK-7789
 URL: https://issues.apache.org/jira/browse/SPARK-7789
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: meiyoula


After creating a hbase table in beeline, then execute select sql statement, 
Executor occurs the exception:
bq.
java.lang.IllegalStateException: Error while configuring input job properties
at 
org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343)
at 
org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279)
at 
org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804)
at 
org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774)
at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:220)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.security.AccessDeniedException: 
org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only 
allowed for Kerberos authenticated clients
at 
org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:124)
at 
org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService$1.getAuthenticationToken(AuthenticationProtos.java:4267)
at 
org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService.callMethod(AuthenticationProtos.java:4387)
at 
org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7696)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1877)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1859)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32209)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2131)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:102)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at

[jira] [Commented] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients


[ 
https://issues.apache.org/jira/browse/SPARK-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553958#comment-14553958
 ] 

meiyoula commented on SPARK-7789:
-

I use the hive source of https://github.com/pwendell/hive/, it has this bug. 
But when I merge the HIVE-8874 into the hive source, everything goes to good.
[~deanchen]Which hive do you use, do you meet this problem?
[~pwendell] I think this is a bug, Can you merge  HIVE-8874 into you hive 
repository?

 sql  on security hbase:Token generation only allowed for Kerberos 
 authenticated clients
 ---

 Key: SPARK-7789
 URL: https://issues.apache.org/jira/browse/SPARK-7789
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: meiyoula

 After creating a hbase table in beeline, then execute select sql statement, 
 Executor occurs the exception:
 {quote}
 java.lang.IllegalStateException: Error while configuring input job properties
 at 
 org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343)
 at 
 org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279)
 at 
 org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804)
 at 
 org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:220)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.hadoop.hbase.security.AccessDeniedException: 
 org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only 
 allowed for Kerberos authenticated clients
 at 
 org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:124)
 at 
 org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService$1.getAuthenticationToken(AuthenticationProtos.java:4267)
 at 
 org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService.callMethod(AuthenticationProtos.java:4387)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7696)
 at 
 org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1877)
 at 
 org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1859)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32209)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2131)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:102)
 at

[jira] [Updated] (SPARK-7789) sql on security hbase:Token generation only allowed for Kerberos authenticated clients


 [ 
https://issues.apache.org/jira/browse/SPARK-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-7789:

Description: 
After creating a hbase table in beeline, then execute select sql statement, 
Executor occurs the exception:
{quote}
java.lang.IllegalStateException: Error while configuring input job properties
at 
org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:343)
at 
org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:279)
at 
org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:804)
at 
org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:774)
at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:300)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:220)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.security.AccessDeniedException: 
org.apache.hadoop.hbase.security.AccessDeniedException: Token generation only 
allowed for Kerberos authenticated clients
at 
org.apache.hadoop.hbase.security.token.TokenProvider.getAuthenticationToken(TokenProvider.java:124)
at 
org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService$1.getAuthenticationToken(AuthenticationProtos.java:4267)
at 
org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService.callMethod(AuthenticationProtos.java:4387)
at 
org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7696)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1877)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1859)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32209)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2131)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:102)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:326)

[jira] [Created] (SPARK-7796) Use the new RPC implementation by default

Shixiong Zhu created SPARK-7796:
---

 Summary: Use the new RPC implementation by default
 Key: SPARK-7796
 URL: https://issues.apache.org/jira/browse/SPARK-7796
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu
 Fix For: 1.6.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7722) Style checks do not run for Kinesis on Jenkins


 [ 
https://issues.apache.org/jira/browse/SPARK-7722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7722:
---

Assignee: Tathagata Das  (was: Apache Spark)

 Style checks do not run for Kinesis on Jenkins
 --

 Key: SPARK-7722
 URL: https://issues.apache.org/jira/browse/SPARK-7722
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, Streaming
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Critical

 This caused the release build to fail late in the game. We should make sure 
 jenkins is proactively checking it:
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commitdiff;h=23cf897112624ece19a3b5e5394cdf71b9c3c8b3;hp=9ebb44f8abb1a13f045eed60190954db904ffef7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-21 Thread Akshat Aranya (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554797#comment-14554797
]

Akshat Aranya edited comment on SPARK-7708 at 5/21/15 6:22 PM:
---

I was able to get this working with a couple of fixes:

1. Implementing serialization methods for Kryo in SerializableBuffer. An
alternative is to register SerializableBuffer with JavaSerialization in Kryo,
but that defeats the purpose.
2. The second part is a bit hokey because tasks within one executor process are
deserialized from a shared broadcast variable. Kryo deserialization modifies
the input buffer, so it isn't thread-safe
(https://code.google.com/p/kryo/issues/detail?id=128). I worked around this by
copying the broadcast buffer to a local buffer before deserializing.

This fixes are for 1.2, so I'll see if I can port them to master and write a
test for them.

was (Author: aaranya):
I was able to get this working with a couple of fixes:

Incorrect task serialization with Kryo closure serializer
-

Key: SPARK-7708
URL: https://issues.apache.org/jira/browse/SPARK-7708
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

I've been investigating the use of Kryo for closure serialization with Spark
1.2, and it seems like I've hit upon a bug:
When a task is serialized before scheduling, the following log message is
generated:
[info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342,
host, PROCESS_LOCAL, 302 bytes)
This message comes from TaskSetManager which serializes the task using the
closure serializer. Before the message is sent out, the TaskDescription
(which included the original task as a byte array), is serialized again into
a byte array with the closure serializer. I added a log message for this in
CoarseGrainedSchedulerBackend, which produces the following output:
[info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
The serialized size of TaskDescription (132 bytes) turns out to be _smaller_
than serialized task that it contains (302 bytes). This implies that
TaskDescription.buffer is not getting serialized correctly.
On the executor side, the deserialization produces a null value for
TaskDescription.buffer.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7722) Style checks do not run for Kinesis on Jenkins


 [ 
https://issues.apache.org/jira/browse/SPARK-7722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7722:
---

Assignee: Apache Spark  (was: Tathagata Das)

 Style checks do not run for Kinesis on Jenkins
 --

 Key: SPARK-7722
 URL: https://issues.apache.org/jira/browse/SPARK-7722
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, Streaming
Reporter: Patrick Wendell
Assignee: Apache Spark
Priority: Critical

 This caused the release build to fail late in the game. We should make sure 
 jenkins is proactively checking it:
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commitdiff;h=23cf897112624ece19a3b5e5394cdf71b9c3c8b3;hp=9ebb44f8abb1a13f045eed60190954db904ffef7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6764) Add wheel package support for PySpark

2015-05-21 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554825#comment-14554825
 ] 

Davies Liu commented on SPARK-6764:
---

My first question is that, can we use wheel package just like egg file ( 
without pip in the worker)? Package or download them in the driver, then send 
them to workers. 

 Add wheel package support for PySpark
 -

 Key: SPARK-6764
 URL: https://issues.apache.org/jira/browse/SPARK-6764
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, PySpark
Reporter: Takao Magoori
Priority: Minor
  Labels: newbie

 We can do _spark-submit_ with one or more Python packages (.egg,.zip and 
 .jar) by *--py-files* option.
 h4. zip packaging
 Spark put a zip file on its working directory and adds the absolute path to 
 Python's sys.path. When the user program imports it, 
 [zipimport|https://docs.python.org/2.7/library/zipimport.html] is 
 automatically invoked under the hood. That is, data-files and dynamic 
 modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and 
 .pyo.
 h4. egg packaging
 Spark put an egg file on its working directory and adds the absolute path to 
 Python's sys.path. Unlike zipimport, egg can handle data files and dynamid 
 modules as far as the author of the package uses [pkg_resources 
 API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations]
  properly. But so many python modules does not use pkg_resources API, that 
 causes ImportErroror No such file error. Moreover, creating eggs of 
 dependencies and further dependencies are troublesome job.
 h4. wheel packaging
 Supporting new Python standard package-format 
 [wheel|https://wheel.readthedocs.org/en/latest/]; would be nice. With wheel, 
 we can do spark-submit with complex dependencies simply as follows.
 1. Write requirements.txt file.
 {noformat}
 SQLAlchemy
 MySQL-python
 requests
 simplejson=3.6.0,=3.6.5
 pydoop
 {noformat}
 2. Do wheel packaging by only one command. All dependencies are wheel-ed.
 {noformat}
 $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement 
 requirements.txt
 {noformat}
 3. Do spark-submit
 {noformat}
 your_spark_home/bin/spark-submit --master local[4] --py-files $(find 
 /tmp/wheelhouse/ -name *.whl -print0 | sed -e 's/\x0/,/g') your_driver.py
 {noformat}
 If your pyspark driver is a package which consists of many modules,
 1. Write setup.py for your pyspark driver package.
 {noformat}
 from setuptools import (
 find_packages,
 setup,
 )
 setup(
 name='yourpkg',
 version='0.0.1',
 packages=find_packages(),
 install_requires=[
 'SQLAlchemy',
 'MySQL-python',
 'requests',
 'simplejson=3.6.0,=3.6.5',
 'pydoop',
 ],
 )
 {noformat}
 2. Do wheel packaging by only one command. Your driver package and all 
 dependencies are wheel-ed.
 {noformat}
 your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/.
 {noformat}
 3. Do spark-submit
 {noformat}
 your_spark_home/bin/spark-submit --master local[4] --py-files $(find 
 /tmp/wheelhouse/ -name *.whl -print0 | sed -e 's/\x0/,/g') 
 your_driver_bootstrap.py
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7724) Add support for Intersect and Except in Catalyst DSL


[ 
https://issues.apache.org/jira/browse/SPARK-7724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554854#comment-14554854
 ] 

Apache Spark commented on SPARK-7724:
-

User 'smola' has created a pull request for this issue:
https://github.com/apache/spark/pull/6327

 Add support for Intersect and Except in Catalyst DSL
 

 Key: SPARK-7724
 URL: https://issues.apache.org/jira/browse/SPARK-7724
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Santiago M. Mola
Priority: Trivial
  Labels: easyfix, starter

 Catalyst DSL to create logical plans supports most of the current plan, but 
 it is missing Except and Intersect. See LogicalPlanFunctions:
 https://github.com/apache/spark/blob/6008ec14ed6491d0a854bb50548c46f2f9709269/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L248



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7795) Speed up task serialization in standalone mode


[ 
https://issues.apache.org/jira/browse/SPARK-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554721#comment-14554721
 ] 

Apache Spark commented on SPARK-7795:
-

User 'coolfrood' has created a pull request for this issue:
https://github.com/apache/spark/pull/6323

 Speed up task serialization in standalone mode
 --

 Key: SPARK-7795
 URL: https://issues.apache.org/jira/browse/SPARK-7795
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Akshat Aranya

 My experiments with scheduling very short tasks in standalone cluster mode 
 indicated that a significant amount of time was being spent in scheduling the 
 tasks (500ms for 256 tasks). I found that most of the time was being spent 
 in creating a new instance of serializer for each task. Changing this to just 
 one serializer brought down the scheduling time to 8ms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7795) Speed up task serialization in standalone mode


 [ 
https://issues.apache.org/jira/browse/SPARK-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7795:
---

Assignee: (was: Apache Spark)

 Speed up task serialization in standalone mode
 --

 Key: SPARK-7795
 URL: https://issues.apache.org/jira/browse/SPARK-7795
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Akshat Aranya

 My experiments with scheduling very short tasks in standalone cluster mode 
 indicated that a significant amount of time was being spent in scheduling the 
 tasks (500ms for 256 tasks). I found that most of the time was being spent 
 in creating a new instance of serializer for each task. Changing this to just 
 one serializer brought down the scheduling time to 8ms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized

2015-05-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7780:
-
Target Version/s: 1.5.0

 The intercept in LogisticRegressionWithLBFGS should not be regularized
 --

 Key: SPARK-7780
 URL: https://issues.apache.org/jira/browse/SPARK-7780
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: DB Tsai

 The intercept in Logistic Regression represents a prior on categories which 
 should not be regularized. In MLlib, the regularization is handled through 
 `Updater`, and the `Updater` penalizes all the components without excluding 
 the intercept which resulting poor training accuracy with regularization.
 The new implementation in ML framework handles this properly, and we should 
 call the implementation in ML from MLlib since majority of users are still 
 using MLlib api. 
 Note that both of them are doing feature scalings to improve the convergence, 
 and the only difference is ML version doesn't regularize the intercept. As a 
 result, when lambda is zero, they will converge to the same solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7797) Remove actorSystem from SparkEnv

Shixiong Zhu created SPARK-7797:
---

 Summary: Remove actorSystem from SparkEnv
 Key: SPARK-7797
 URL: https://issues.apache.org/jira/browse/SPARK-7797
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7800) isDefined should not marked too early in putNewKey

2015-05-21 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-7800:
--

 Summary: isDefined should not marked too early in putNewKey
 Key: SPARK-7800
 URL: https://issues.apache.org/jira/browse/SPARK-7800
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Liang-Chi Hsieh


isDefined is marked as true twice in Location.putNewKey. The first one is 
unnecessary and will cause problem because it is too early and before some 
assert checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7737) parquet schema discovery should not fail because of empty _temporary dir


 [ 
https://issues.apache.org/jira/browse/SPARK-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7737:

Assignee: Yin Huai  (was: Cheng Lian)

 parquet schema discovery should not fail because of empty _temporary dir 
 -

 Key: SPARK-7737
 URL: https://issues.apache.org/jira/browse/SPARK-7737
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker

 Parquet schema discovery will fail when the dir is like 
 {code}
 /partitions5k/i=2/_SUCCESS
 /partitions5k/i=2/_temporary/
 /partitions5k/i=2/part-r-1.gz.parquet
 /partitions5k/i=2/part-r-2.gz.parquet
 /partitions5k/i=2/part-r-3.gz.parquet
 /partitions5k/i=2/part-r-4.gz.parquet
 {code}
 {code}
 java.lang.AssertionError: assertion failed: Conflicting partition column 
 names detected:
   
   at scala.Predef$.assert(Predef.scala:179)
   at 
 org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:159)
   at 
 org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:71)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:468)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:424)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:423)
   at scala.Option.getOrElse(Option.scala:120)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:422)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:482)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:480)
   at 
 org.apache.spark.sql.sources.LogicalRelation.init(LogicalRelation.scala:30)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:134)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:118)
   at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1135)
 {code}
 1.3 works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7737) parquet schema discovery should not fail because of empty _temporary dir


[ 
https://issues.apache.org/jira/browse/SPARK-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554769#comment-14554769
 ] 

Yin Huai commented on SPARK-7737:
-

Seems https://github.com/apache/spark/pull/6287 still not fix partition 
discovery completely. For the case in the description, the following case works
{code}
load(/partitions5k/i=2/, parquet)
{code}
However, for this case, we fail...
{code}
load(/partitions5k/, parquet)
{code}

 parquet schema discovery should not fail because of empty _temporary dir 
 -

 Key: SPARK-7737
 URL: https://issues.apache.org/jira/browse/SPARK-7737
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Blocker

 Parquet schema discovery will fail when the dir is like 
 {code}
 /partitions5k/i=2/_SUCCESS
 /partitions5k/i=2/_temporary/
 /partitions5k/i=2/part-r-1.gz.parquet
 /partitions5k/i=2/part-r-2.gz.parquet
 /partitions5k/i=2/part-r-3.gz.parquet
 /partitions5k/i=2/part-r-4.gz.parquet
 {code}
 {code}
 java.lang.AssertionError: assertion failed: Conflicting partition column 
 names detected:
   
   at scala.Predef$.assert(Predef.scala:179)
   at 
 org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:159)
   at 
 org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:71)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:468)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:424)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:423)
   at scala.Option.getOrElse(Option.scala:120)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:422)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:482)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:480)
   at 
 org.apache.spark.sql.sources.LogicalRelation.init(LogicalRelation.scala:30)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:134)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:118)
   at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1135)
 {code}
 1.3 works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7800) isDefined should not marked too early in putNewKey


 [ 
https://issues.apache.org/jira/browse/SPARK-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7800:
--
Affects Version/s: 1.4.0

 isDefined should not marked too early in putNewKey
 --

 Key: SPARK-7800
 URL: https://issues.apache.org/jira/browse/SPARK-7800
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Liang-Chi Hsieh

 isDefined is marked as true twice in Location.putNewKey. The first one is 
 unnecessary and will cause problem because it is too early and before some 
 assert checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7800) isDefined should not marked too early in putNewKey


 [ 
https://issues.apache.org/jira/browse/SPARK-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7800:
--
Assignee: Liang-Chi Hsieh

 isDefined should not marked too early in putNewKey
 --

 Key: SPARK-7800
 URL: https://issues.apache.org/jira/browse/SPARK-7800
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor

 isDefined is marked as true twice in Location.putNewKey. The first one is 
 unnecessary and will cause problem because it is too early and before some 
 assert checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7800) isDefined should not marked too early in putNewKey


 [ 
https://issues.apache.org/jira/browse/SPARK-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7800:
--
Priority: Minor  (was: Major)

 isDefined should not marked too early in putNewKey
 --

 Key: SPARK-7800
 URL: https://issues.apache.org/jira/browse/SPARK-7800
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Liang-Chi Hsieh
Priority: Minor

 isDefined is marked as true twice in Location.putNewKey. The first one is 
 unnecessary and will cause problem because it is too early and before some 
 assert checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7749) Parquet metastore conversion does not use metastore cache


 [ 
https://issues.apache.org/jira/browse/SPARK-7749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-7749.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6287
[https://github.com/apache/spark/pull/6287]

 Parquet metastore conversion does not use metastore cache
 -

 Key: SPARK-7749
 URL: https://issues.apache.org/jira/browse/SPARK-7749
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.4.0


 Seems 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L462-467
  is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7712) Native Spark Window Functions Performance Improvements

2015-05-21 Thread Herman van Hovell tot Westerflier (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael Armbrust updated SPARK-7712:

Shepherd: Yin Huai

Native Spark Window Functions Performance Improvements
-

Key: SPARK-7712
URL: https://issues.apache.org/jira/browse/SPARK-7712
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 1.4.0
Reporter: Herman van Hovell tot Westerflier
Fix For: 1.5.0

Original Estimate: 336h
Remaining Estimate: 336h

Hi All,
After playing with the current spark window implementation, I tried to take
this to next level. My main goal is/was to address the following issues:
Native Spark SQL Performance.
*Native Spark SQL*
The current implementation uses Hive UDAFs as its aggregation mechanism. We
try to address the following issues by moving to a more 'native' Spark SQL
approach:
- Window functions require Hive. Some people (mostly by accident) use Spark
SQL without Hive. Usage of UDAFs is still supported though.
- Adding your own Aggregates requires you to write them in Hive instead of
native Spark SQL.
- Hive UDAFs are very well written and quite quick, but they are opaque in
processing and memory management; this makes them hard to optimize. By using
'Native' Spark SQL constructs we can actually do alot more optimization, for
example AggregateEvaluation style Window processing (this would require us to
move some of the code out of the AggregateEvaluation class into some Common
base class), or Tungten style memory management.
*Performance*
- Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current
implementation in spark uses a sliding window approach in these cases. This
means that an aggregate is maintained for every row, so space usage is N (N
being the number of rows). This also means that all these aggregates all need
to be updated separately, this takes N*(N-1)/2 updates. The running case
differs from the Sliding case because we are only adding data to an aggregate
function (no reset is required), we only need to maintain one aggregate (like
in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each
row, and get the aggregate value after each update. This is what the new
implementation does. This approach only uses 1 buffer, and only requires N
updates; I am currently working on data with window sizes of 500-1000 doing
running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED
FOLLOWING case also uses this approach and the fact that aggregate operations
are communitative, there is one twist though it will process the input buffer
in reverse.
- Fewer comparisons in the sliding case. The current implementation
determines frame boundaries for every input row. The new implementation makes
more use of the fact that the window is sorted, maintains the boundaries, and
only moves them when the current row order changes. This is a minor
improvement.
- A single Window node is able to process all types of Frames for the same
Partitioning/Ordering. This saves a little time/memory spent buffering and
managing partitions.
- A lot of the staging code is moved from the execution phase to the
initialization phase. Minor performance improvement, and improves readability
of the execution code.
The original work including some benchmarking code for the running case can
be here: https://github.com/hvanhovell/spark-window
A PR has been created, this is still work in progress, and can be found here:
https://github.com/apache/spark/pull/6278
Comments, feedback and other discussion is much appreciated.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7712) Native Spark Window Functions Performance Improvements

[
https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554753#comment-14554753
]

Herman van Hovell tot Westerflier commented on SPARK-7712:
--

I have updated the description to explain the design.

Native Spark Window Functions Performance Improvements
-

Original Estimate: 336h
Remaining Estimate: 336h

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7758) Failed to start thrift server when metastore is postgre sql


 [ 
https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7758:

 Target Version/s: 1.4.0
Affects Version/s: 1.4.0

 Failed to start thrift server when metastore is postgre sql
 ---

 Key: SPARK-7758
 URL: https://issues.apache.org/jira/browse/SPARK-7758
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Tao Wang
Priority: Critical
 Attachments: hive-site.xml, with error.log, with no error.log


 I am using today's master branch to start thrift server with setting 
 metastore to postgre sql, and it shows error like:
 {code}
 15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE
 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE 
 DELETEME1432125837197 CASCADE : Syntax error: Encountered CASCADE at line 
 1, column 34.
 java.sql.SQLSyntaxErrorException: Syntax error: Encountered CASCADE at line 
 1, column 34.
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
 Source)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
 Source)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
   at 
 org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
   at 
 org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)`
 But it works well with earlier master branch (on 7th, April).
 After printing their debug level log, I found current branch tries to connect 
 with derby but didn't know why, maybe the big reconstructure in sql module 
 cause this issue.
 The Datastore shows in current branch:
 15/05/20 20:43:57 DEBUG Datastore: === Datastore 
 =
 15/05/20 20:43:57 DEBUG Datastore: StoreManager : rdbms 
 (org.datanucleus.store.rdbms.RDBMSStoreManager)
 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write
 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), 
 Validate(None)
 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, 
 STOREDPROC]
 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0
 15/05/20 20:43:57 DEBUG Datastore: 
 ===
 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : 
 org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter
 15/05/20 20:43:57 DEBUG Datastore: Datastore : name=Apache Derby 
 version=10.10.1.1 - (1458268)
 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name=Apache Derby 
 Embedded JDBC Driver version=10.10.1.1 - (1458268)
 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : 
 URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : 
 URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : 
 factory=datanucleus1 case=UPPERCASE catalog= schema=SPARK
 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : MixedCase 
 UPPERCASE MixedCase-Sensitive 
 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : 
 Table=128 Column=128 Constraint=128 Index=128 Delimiter=
 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : 
 catalog=false schema=true
 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, 
 rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL
 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes 
 (max-batch-size=50)
 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, 
 type=forward-only, concurrency=read-only
 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255
 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, 
 DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, 
 VARBINARY, JAVA_OBJECT
 15/05/20 20:43:57 DEBUG Datastore: 
 ===
 The Datastore in earlier master branch:
 15/05/20 20:18:10 DEBUG Datastore: === Datastore 
 =
 15/05/20 20:18:10 DEBUG Datastore:

[jira] [Closed] (SPARK-7773) Document `spark.dynamicAllocation.initialExecutors`

2015-05-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7773.

Resolution: Won't Fix

Turns out it was already documented?

 Document `spark.dynamicAllocation.initialExecutors`
 ---

 Key: SPARK-7773
 URL: https://issues.apache.org/jira/browse/SPARK-7773
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or

 A small omission.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-21 Thread Akshat Aranya (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554797#comment-14554797
]

Akshat Aranya commented on SPARK-7708:
--

I was able to get this working with a couple of fixes:

Incorrect task serialization with Kryo closure serializer
-

Key: SPARK-7708
URL: https://issues.apache.org/jira/browse/SPARK-7708
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7787) SerializableAWSCredentials in KinesisReceiver cannot be deserialized


 [ 
https://issues.apache.org/jira/browse/SPARK-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7787:
-
Reporter: Chris Fregly  (was: Tathagata Das)

 SerializableAWSCredentials in KinesisReceiver cannot be deserialized 
 -

 Key: SPARK-7787
 URL: https://issues.apache.org/jira/browse/SPARK-7787
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Chris Fregly
Assignee: Tathagata Das
Priority: Blocker
 Fix For: 1.4.0


 Lack of default constructor causes deserialization to fail. This occurs only 
 when the AWS credentials are explicitly specified through KinesisUtils.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7787) SerializableAWSCredentials in KinesisReceiver cannot be deserialized

2015-05-21 Thread Herman van Hovell tot Westerflier (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-7787.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

 SerializableAWSCredentials in KinesisReceiver cannot be deserialized 
 -

 Key: SPARK-7787
 URL: https://issues.apache.org/jira/browse/SPARK-7787
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Chris Fregly
Assignee: Tathagata Das
Priority: Blocker
 Fix For: 1.4.0


 Lack of default constructor causes deserialization to fail. This occurs only 
 when the AWS credentials are explicitly specified through KinesisUtils.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7712) Native Spark Window Functions Performance Improvements

[
https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Herman van Hovell tot Westerflier updated SPARK-7712:
-
Description:
Hi All,

After playing with the current spark window implementation, I tried to take
this to next level. My main goal is/was to address the following issues: Native
Spark SQL Performance.

*Native Spark SQL*
The current implementation uses Hive UDAFs as its aggregation mechanism. We try
to address the following issues by moving to a more 'native' Spark SQL
approach:
- Window functions require Hive. Some people (mostly by accident) use Spark SQL
without Hive. Usage of UDAFs is still supported though.
- Adding your own Aggregates requires you to write them in Hive instead of
native Spark SQL.
- Hive UDAFs are very well written and quite quick, but they are opaque in
processing and memory management; this makes them hard to optimize. By using
'Native' Spark SQL constructs we can actually do alot more optimization, for
example AggregateEvaluation style Window processing (this would require us to
move some of the code out of the AggregateEvaluation class into some Common
base class), or Tungten style memory management.

*Performance*
- Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current
implementation in spark uses a sliding window approach in these cases. This
means that an aggregate is maintained for every row, so space usage is N (N
being the number of rows). This also means that all these aggregates all need
to be updated separately, this takes N*(N-1)/2 updates. The running case
differs from the Sliding case because we are only adding data to an aggregate
function (no reset is required), we only need to maintain one aggregate (like
in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each
row, and get the aggregate value after each update. This is what the new
implementation does. This approach only uses 1 buffer, and only requires N
updates; I am currently working on data with window sizes of 500-1000 doing
running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED
FOLLOWING case also uses this approach and the fact that aggregate operations
are communitative, there is one twist though it will process the input buffer
in reverse.
- Fewer comparisons in the sliding case. The current implementation determines
frame boundaries for every input row. The new implementation makes more use of
the fact that the window is sorted, maintains the boundaries, and only moves
them when the current row order changes. This is a minor improvement.
- A single Window node is able to process all types of Frames for the same
Partitioning/Ordering. This saves a little time/memory spent buffering and
managing partitions.
- A lot of the staging code is moved from the execution phase to the
initialization phase. Minor performance improvement, and improves readability
of the execution code.

The original work including some benchmarking code for the running case can be
here: https://github.com/hvanhovell/spark-window

A PR has been created, this is still work in progress, and can be found here:
https://github.com/apache/spark/pull/6278

Comments, feedback and other discussion is much appreciated.

was:
Hi All,

After playing with the current spark window implementation, I tried to take
this to next level. My main goal is/was to address the following issues: Native
Spark SQL Performance.

*Native Spark SQL*
The current implementation uses Hive UDAFs as its aggregation mechanism. We try
to address the following issues by moving to a more 'native' Spark SQL
approach:
-Window functions require Hive. Some people (mostly by accident) use Spark SQL
without Hive. Usage of UDAFs is still supported though.
-Adding your own Aggregates requires you to write them in Hive instead of
native Spark SQL.
-Hive UDAFs are very well written and quite quick, but they are opaque in
processing and memory management; this makes them hard to optimize. By using
'Native' Spark SQL constructs we can actually do alot more optimization, for
example AggregateEvaluation style Window processing (this would require us to
move some of the code out of the AggregateEvaluation class into some Common
base class), or Tungten style memory management.

[jira] [Commented] (SPARK-7796) Use the new RPC implementation by default


[ 
https://issues.apache.org/jira/browse/SPARK-7796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554748#comment-14554748
 ] 

Shixiong Zhu commented on SPARK-7796:
-

Sorry. Set a wrong field.

 Use the new RPC implementation by default
 -

 Key: SPARK-7796
 URL: https://issues.apache.org/jira/browse/SPARK-7796
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7758) Failed to start thrift server when metastore is postgre sql


 [ 
https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7758:

Priority: Blocker  (was: Critical)

 Failed to start thrift server when metastore is postgre sql
 ---

 Key: SPARK-7758
 URL: https://issues.apache.org/jira/browse/SPARK-7758
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Tao Wang
Priority: Blocker
 Attachments: hive-site.xml, with error.log, with no error.log


 I am using today's master branch to start thrift server with setting 
 metastore to postgre sql, and it shows error like:
 {code}
 15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE
 15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE 
 DELETEME1432125837197 CASCADE : Syntax error: Encountered CASCADE at line 
 1, column 34.
 java.sql.SQLSyntaxErrorException: Syntax error: Encountered CASCADE at line 
 1, column 34.
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
 Source)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
 Source)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
 Source)
   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
   at 
 org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
   at 
 org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)`
 But it works well with earlier master branch (on 7th, April).
 After printing their debug level log, I found current branch tries to connect 
 with derby but didn't know why, maybe the big reconstructure in sql module 
 cause this issue.
 The Datastore shows in current branch:
 15/05/20 20:43:57 DEBUG Datastore: === Datastore 
 =
 15/05/20 20:43:57 DEBUG Datastore: StoreManager : rdbms 
 (org.datanucleus.store.rdbms.RDBMSStoreManager)
 15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write
 15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), 
 Validate(None)
 15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, 
 STOREDPROC]
 15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0
 15/05/20 20:43:57 DEBUG Datastore: 
 ===
 15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : 
 org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter
 15/05/20 20:43:57 DEBUG Datastore: Datastore : name=Apache Derby 
 version=10.10.1.1 - (1458268)
 15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name=Apache Derby 
 Embedded JDBC Driver version=10.10.1.1 - (1458268)
 15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : 
 URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
 15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : 
 URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
 15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : 
 factory=datanucleus1 case=UPPERCASE catalog= schema=SPARK
 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : MixedCase 
 UPPERCASE MixedCase-Sensitive 
 15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : 
 Table=128 Column=128 Constraint=128 Index=128 Delimiter=
 15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : 
 catalog=false schema=true
 15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, 
 rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL
 15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes 
 (max-batch-size=50)
 15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, 
 type=forward-only, concurrency=read-only
 15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255
 15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, 
 DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, 
 VARBINARY, JAVA_OBJECT
 15/05/20 20:43:57 DEBUG Datastore: 
 ===
 The Datastore in earlier master branch:
 15/05/20 20:18:10 DEBUG Datastore: === Datastore 
 =
 15/05/20 20:18:10 DEBUG Datastore: StoreManager :

[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer


[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554779#comment-14554779
 ] 

Josh Rosen commented on SPARK-7708:
---

It doesn't surprise me that Kryo doesn't work for closure serialization, since 
I don't know that we have any end-to-end tests that run real jobs with the Kryo 
closure serializer (we do have tests for using it for data serialization, 
though).  I'd really like to fix this, though, so a pull request with a failing 
regression test and a fix would be very welcome.  Perhaps we need to register 
SerializableBuffer with Kryo or do something similar so that Kryo handles this 
case properly.

 Incorrect task serialization with Kryo closure serializer
 -

 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

 I've been investigating the use of Kryo for closure serialization with Spark 
 1.2, and it seems like I've hit upon a bug:
 When a task is serialized before scheduling, the following log message is 
 generated:
 [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
 host, PROCESS_LOCAL, 302 bytes)
 This message comes from TaskSetManager which serializes the task using the 
 closure serializer.  Before the message is sent out, the TaskDescription 
 (which included the original task as a byte array), is serialized again into 
 a byte array with the closure serializer.  I added a log message for this in 
 CoarseGrainedSchedulerBackend, which produces the following output:
 [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
 than serialized task that it contains (302 bytes). This implies that 
 TaskDescription.buffer is not getting serialized correctly.
 On the executor side, the deserialization produces a null value for 
 TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6740) SQL operator and condition precedence is not honoured


 [ 
https://issues.apache.org/jira/browse/SPARK-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6740:
---

Assignee: (was: Apache Spark)

 SQL operator and condition precedence is not honoured
 -

 Key: SPARK-6740
 URL: https://issues.apache.org/jira/browse/SPARK-6740
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola

 The following query from the SQL Logic Test suite fails to parse:
 SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT ( - _2 + - 39 ) IS NULL
 while the following (equivalent) does parse correctly:
 SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT (( - _2 + - 39 ) IS NULL)
 SQLite, MySQL and Oracle (and probably most SQL implementations) define IS 
 with higher precedence than NOT, so the first query is valid and well-defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6740) SQL operator and condition precedence is not honoured


[ 
https://issues.apache.org/jira/browse/SPARK-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554834#comment-14554834
 ] 

Apache Spark commented on SPARK-6740:
-

User 'smola' has created a pull request for this issue:
https://github.com/apache/spark/pull/6326

 SQL operator and condition precedence is not honoured
 -

 Key: SPARK-6740
 URL: https://issues.apache.org/jira/browse/SPARK-6740
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola

 The following query from the SQL Logic Test suite fails to parse:
 SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT ( - _2 + - 39 ) IS NULL
 while the following (equivalent) does parse correctly:
 SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT (( - _2 + - 39 ) IS NULL)
 SQLite, MySQL and Oracle (and probably most SQL implementations) define IS 
 with higher precedence than NOT, so the first query is valid and well-defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6740) SQL operator and condition precedence is not honoured


 [ 
https://issues.apache.org/jira/browse/SPARK-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6740:
---

Assignee: Apache Spark

 SQL operator and condition precedence is not honoured
 -

 Key: SPARK-6740
 URL: https://issues.apache.org/jira/browse/SPARK-6740
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Assignee: Apache Spark

 The following query from the SQL Logic Test suite fails to parse:
 SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT ( - _2 + - 39 ) IS NULL
 while the following (equivalent) does parse correctly:
 SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT (( - _2 + - 39 ) IS NULL)
 SQLite, MySQL and Oracle (and probably most SQL implementations) define IS 
 with higher precedence than NOT, so the first query is valid and well-defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6956) Improve DataFrame API compatibility with Pandas


 [ 
https://issues.apache.org/jira/browse/SPARK-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6956:
---
Target Version/s: 1.4.0  (was: 1.5.0)

 Improve DataFrame API compatibility with Pandas
 ---

 Key: SPARK-6956
 URL: https://issues.apache.org/jira/browse/SPARK-6956
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu
 Fix For: 1.4.0


 This is not always possible, but whenever possible we should remove or reduce 
 the differences between Pandas and Spark DataFrames in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7394) Add Pandas style cast (astype)


 [ 
https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7394.

   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Chen Song

 Add Pandas style cast (astype)
 --

 Key: SPARK-7394
 URL: https://issues.apache.org/jira/browse/SPARK-7394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Chen Song
  Labels: starter
 Fix For: 1.4.0


 Basically alias astype == cast in Column for Python (and Python only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer


[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554819#comment-14554819
 ] 

Josh Rosen commented on SPARK-7708:
---

I can investigate this in more detail later, but is registering 
SerializableBuffer with JavaSerialization really that expensive?  I imagine 
that the most expensive part of the closure serialization is serializing the 
closure itself.  Once we already have the serialized closure's bytes, I don't 
think that including them in the TaskDescription should be very expensive 
because we're only dealing with one or two fields vs. an entire object graph.  
On the other hand, if it's not much work to implement Kryo serialization 
methods for that class then maybe that's fine, too.

Good find with that Kryo deserialization thread-safety issue; I'd be interested 
in seeing your fix for this.

Feel free to submit a PR for this; I'd be happy to help review.

 Incorrect task serialization with Kryo closure serializer
 -

 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

 I've been investigating the use of Kryo for closure serialization with Spark 
 1.2, and it seems like I've hit upon a bug:
 When a task is serialized before scheduling, the following log message is 
 generated:
 [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
 host, PROCESS_LOCAL, 302 bytes)
 This message comes from TaskSetManager which serializes the task using the 
 closure serializer.  Before the message is sent out, the TaskDescription 
 (which included the original task as a byte array), is serialized again into 
 a byte array with the closure serializer.  I added a log message for this in 
 CoarseGrainedSchedulerBackend, which produces the following output:
 [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
 than serialized task that it contains (302 bytes). This implies that 
 TaskDescription.buffer is not getting serialized correctly.
 On the executor side, the deserialization produces a null value for 
 TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-7724) Add support for Intersect and Except in Catalyst DSL

2015-05-21 Thread Santiago M. Mola (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santiago M. Mola reopened SPARK-7724:
-

Thanks. Here's a PR.

 Add support for Intersect and Except in Catalyst DSL
 

 Key: SPARK-7724
 URL: https://issues.apache.org/jira/browse/SPARK-7724
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Santiago M. Mola
Priority: Trivial
  Labels: easyfix, starter

 Catalyst DSL to create logical plans supports most of the current plan, but 
 it is missing Except and Intersect. See LogicalPlanFunctions:
 https://github.com/apache/spark/blob/6008ec14ed6491d0a854bb50548c46f2f9709269/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L248



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7781) GradientBoostedTrees.trainRegressor is missing maxBins parameter in pyspark

2015-05-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7781:
-
Target Version/s: 1.5.0

 GradientBoostedTrees.trainRegressor is missing maxBins parameter in pyspark
 ---

 Key: SPARK-7781
 URL: https://issues.apache.org/jira/browse/SPARK-7781
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Don Drake

 I'm running Spark v1.3.1 and when I run the following against my dataset:
 {code}
 model = GradientBoostedTrees.trainRegressor(trainingData, 
 categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
 The job will fail with the following message:
 Traceback (most recent call last):
   File /Users/drake/fd/spark/mltest.py, line 73, in module
 model = GradientBoostedTrees.trainRegressor(trainingData, 
 categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
   File 
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py, 
 line 553, in trainRegressor
 loss, numIterations, learningRate, maxDepth)
   File 
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py, 
 line 438, in _train
 loss, numIterations, learningRate, maxDepth)
   File 
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py,
  line 120, in callMLlibFunc
 return callJavaFunc(sc, api, *args)
   File 
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py,
  line 113, in callJavaFunc
 return _java2py(sc, func(*args))
   File 
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 o69.trainGradientBoostedTreesModel.
 : java.lang.IllegalArgumentException: requirement failed: DecisionTree 
 requires maxBins (= 32) = max categories in categorical features (= 1895)
   at scala.Predef$.require(Predef.scala:233)
   at 
 org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128)
   at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138)
   at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
   at 
 org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
   at 
 org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
   at 
 org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
   at 
 org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595)
 {code}
 So, it's complaining about the maxBins, if I provide maxBins=1900 and re-run 
 it:
 {code}
 model = GradientBoostedTrees.trainRegressor(trainingData, 
 categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3, 
 maxBins=1900)
 Traceback (most recent call last):
   File /Users/drake/fd/spark/mltest.py, line 73, in module
 model = GradientBoostedTrees.trainRegressor(trainingData, 
 categoricalFeaturesInfo=catF
 eatures, maxDepth=6, numIterations=3, maxBins=1900)
 TypeError: trainRegressor() got an unexpected keyword argument 'maxBins'
 {code}
 It now says it knows nothing of maxBins.
 If I run the same command against DecisionTree or RandomForest (with 
 maxBins=1900) it works just fine.
 Seems like a bug in GradientBoostedTrees. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7798) Move AkkaRpcEnv to a separate project

2015-05-21 Thread Herman van Hovell tot Westerflier (JIRA)

Shixiong Zhu created SPARK-7798:
---

 Summary: Move AkkaRpcEnv to a separate project
 Key: SPARK-7798
 URL: https://issues.apache.org/jira/browse/SPARK-7798
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7796) Use the new RPC implementation by default

2015-05-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7796:
-
Fix Version/s: (was: 1.6.0)

[~zsxwing] don't set Fix Version

 Use the new RPC implementation by default
 -

 Key: SPARK-7796
 URL: https://issues.apache.org/jira/browse/SPARK-7796
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7712) Native Spark Window Functions Performance Improvements

[
https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Herman van Hovell tot Westerflier updated SPARK-7712:
-
Description:
Hi All,

After playing with the current spark window implementation, I tried to take
this to next level. My main goal is/was to address the following issues: Native
Spark SQL Performance.

*Native Spark SQL*
The current implementation uses Hive UDAFs as its aggregation mechanism. We try
to address the following issues by moving to a more 'native' Spark SQL
approach:
-Window functions require Hive. Some people (mostly by accident) use Spark SQL
without Hive. Usage of UDAFs is still supported though.
-Adding your own Aggregates requires you to write them in Hive instead of
native Spark SQL.
-Hive UDAFs are very well written and quite quick, but they are opaque in
processing and memory management; this makes them hard to optimize. By using
'Native' Spark SQL constructs we can actually do alot more optimization, for
example AggregateEvaluation style Window processing (this would require us to
move some of the code out of the AggregateEvaluation class into some Common
base class), or Tungten style memory management.

The original work including some benchmarking code for the running case can be
here: https://github.com/hvanhovell/spark-window

A PR has been created, this is still work in progress, and can be found here:

I will try to turn this into a PR in the next couple of days. Meanwhile
comments, feedback and other discussion is much appreciated.

was:
Hi All,

After playing with the current spark window implementation, I tried to take
this to next level. My main goal is/was to address the following issues:
- Native Spark-SQL, the current implementation relies only on Hive UDAFs. The
improved implementation uses Spark SQL Aggregates. Hive UDAF's are still
supported though.
- Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases.
- Increased optimization opportunities. AggregateEvaluation style optimization
should be possible for in frame processing. Tungsten might also provide
interesting optimization opportunities.

The current work is available at the following location:
https://github.com/hvanhovell/spark-window

I will try to turn this into a PR in the next couple of days. Meanwhile
comments, feedback and other discussion is much appreciated.

Native Spark Window Functions Performance Improvements
-

Original Estimate: 336h
Remaining Estimate: 336h

Hi All,
After playing with the current spark window implementation, I tried to take
this to next level. My main goal is/was to address the following

[jira] [Updated] (SPARK-7797) Remove actorSystem from SparkEnv


 [ 
https://issues.apache.org/jira/browse/SPARK-7797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-7797:

Target Version/s: 1.6.0

 Remove actorSystem from SparkEnv
 --

 Key: SPARK-7797
 URL: https://issues.apache.org/jira/browse/SPARK-7797
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7800) isDefined should not marked too early in putNewKey


 [ 
https://issues.apache.org/jira/browse/SPARK-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7800:
---

Assignee: (was: Apache Spark)

 isDefined should not marked too early in putNewKey
 --

 Key: SPARK-7800
 URL: https://issues.apache.org/jira/browse/SPARK-7800
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Liang-Chi Hsieh

 isDefined is marked as true twice in Location.putNewKey. The first one is 
 unnecessary and will cause problem because it is too early and before some 
 assert checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7800) isDefined should not marked too early in putNewKey


[ 
https://issues.apache.org/jira/browse/SPARK-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554746#comment-14554746
 ] 

Apache Spark commented on SPARK-7800:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6324

 isDefined should not marked too early in putNewKey
 --

 Key: SPARK-7800
 URL: https://issues.apache.org/jira/browse/SPARK-7800
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Liang-Chi Hsieh

 isDefined is marked as true twice in Location.putNewKey. The first one is 
 unnecessary and will cause problem because it is too early and before some 
 assert checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7800) isDefined should not marked too early in putNewKey


 [ 
https://issues.apache.org/jira/browse/SPARK-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7800:
---

Assignee: Apache Spark

 isDefined should not marked too early in putNewKey
 --

 Key: SPARK-7800
 URL: https://issues.apache.org/jira/browse/SPARK-7800
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Liang-Chi Hsieh
Assignee: Apache Spark

 isDefined is marked as true twice in Location.putNewKey. The first one is 
 unnecessary and will cause problem because it is too early and before some 
 assert checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7758) Failed to start thrift server when metastore is postgre sql


 [ 
https://issues.apache.org/jira/browse/SPARK-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7758:

Description: 
I am using today's master branch to start thrift server with setting metastore 
to postgre sql, and it shows error like:
{code}
15/05/20 20:43:57 DEBUG Schema: DROP TABLE DELETEME1432125837197 CASCADE
15/05/20 20:43:57 ERROR Datastore: Error thrown executing DROP TABLE 
DELETEME1432125837197 CASCADE : Syntax error: Encountered CASCADE at line 1, 
column 34.
java.sql.SQLSyntaxErrorException: Syntax error: Encountered CASCADE at line 
1, column 34.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
Source)
at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
at 
org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)
at 
org.datanucleus.store.rdbms.datasource.dbcp.DelegatingStatement.execute(DelegatingStatement.java:264)`

But it works well with earlier master branch (on 7th, April).
After printing their debug level log, I found current branch tries to connect 
with derby but didn't know why, maybe the big reconstructure in sql module 
cause this issue.

The Datastore shows in current branch:
15/05/20 20:43:57 DEBUG Datastore: === Datastore 
=
15/05/20 20:43:57 DEBUG Datastore: StoreManager : rdbms 
(org.datanucleus.store.rdbms.RDBMSStoreManager)
15/05/20 20:43:57 DEBUG Datastore: Datastore : read-write
15/05/20 20:43:57 DEBUG Datastore: Schema Control : AutoCreate(None), 
Validate(None)
15/05/20 20:43:57 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, 
STOREDPROC]
15/05/20 20:43:57 DEBUG Datastore: Queries : Timeout=0
15/05/20 20:43:57 DEBUG Datastore: 
===
15/05/20 20:43:57 DEBUG Datastore: Datastore Adapter : 
org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter
15/05/20 20:43:57 DEBUG Datastore: Datastore : name=Apache Derby 
version=10.10.1.1 - (1458268)
15/05/20 20:43:57 DEBUG Datastore: Datastore Driver : name=Apache Derby 
Embedded JDBC Driver version=10.10.1.1 - (1458268)
15/05/20 20:43:57 DEBUG Datastore: Primary Connection Factory : 
URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
15/05/20 20:43:57 DEBUG Datastore: Secondary Connection Factory : 
URL[jdbc:derby:;databaseName=/tmp/spark-8b38e943-01e5-4341-9c92-7c250f2dec96/metastore;create=true]
15/05/20 20:43:57 DEBUG Datastore: Datastore Identifiers : 
factory=datanucleus1 case=UPPERCASE catalog= schema=SPARK
15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Cases : MixedCase 
UPPERCASE MixedCase-Sensitive 
15/05/20 20:43:57 DEBUG Datastore: Supported Identifier Lengths (max) : 
Table=128 Column=128 Constraint=128 Index=128 Delimiter=
15/05/20 20:43:57 DEBUG Datastore: Support for Identifiers in DDL : 
catalog=false schema=true
15/05/20 20:43:57 DEBUG Datastore: Datastore : checkTableViewExistence, 
rdbmsConstraintCreateMode=DataNucleus, initialiseColumnInfo=ALL
15/05/20 20:43:57 DEBUG Datastore: Support Statement Batching : yes 
(max-batch-size=50)
15/05/20 20:43:57 DEBUG Datastore: Queries : Results direction=forward, 
type=forward-only, concurrency=read-only
15/05/20 20:43:57 DEBUG Datastore: Java-Types : string-default-length=255
15/05/20 20:43:57 DEBUG Datastore: JDBC-Types : [id=2009], BLOB, CLOB, TIME, 
DATE, BOOLEAN, VARCHAR, DECIMAL, NUMERIC, CHAR, BINARY, FLOAT, LONGVARBINARY, 
VARBINARY, JAVA_OBJECT
15/05/20 20:43:57 DEBUG Datastore: 
===

The Datastore in earlier master branch:
15/05/20 20:18:10 DEBUG Datastore: === Datastore 
=
15/05/20 20:18:10 DEBUG Datastore: StoreManager : rdbms 
(org.datanucleus.store.rdbms.RDBMSStoreManager)
15/05/20 20:18:10 DEBUG Datastore: Datastore : read-write
15/05/20 20:18:10 DEBUG Datastore: Schema Control : AutoCreate(None), 
Validate(None)
15/05/20 20:18:10 DEBUG Datastore: Query Languages : [JDOQL, JPQL, SQL, 
STOREDPROC]
15/05/20 20:18:10 DEBUG Datastore: Queries : Timeout=0
15/05/20 20:18:10 DEBUG Datastore: 
===
15/05/20 20:18:10 DEBUG Datastore: Datastore Adapter : 
org.datanucleus.store.rdbms.adapter.PostgreSQLAdapter
15/05/20

[jira] [Commented] (SPARK-7722) Style checks do not run for Kinesis on Jenkins


[ 
https://issues.apache.org/jira/browse/SPARK-7722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554802#comment-14554802
 ] 

Apache Spark commented on SPARK-7722:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/6325

 Style checks do not run for Kinesis on Jenkins
 --

 Key: SPARK-7722
 URL: https://issues.apache.org/jira/browse/SPARK-7722
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, Streaming
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Critical

 This caused the release build to fail late in the game. We should make sure 
 jenkins is proactively checking it:
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commitdiff;h=23cf897112624ece19a3b5e5394cdf71b9c3c8b3;hp=9ebb44f8abb1a13f045eed60190954db904ffef7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7754) Use PartialFunction literals instead of objects in Catalyst

2015-05-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7754:
-
Component/s: SQL

 Use PartialFunction literals instead of objects in Catalyst
 ---

 Key: SPARK-7754
 URL: https://issues.apache.org/jira/browse/SPARK-7754
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Edoardo Vacchi

 Catalyst rules extend two distinct rule types: {{Rule[LogicalPlan]}} and 
 {{Strategy}} (which is an alias for {{GenericStrategy[SparkPlan]}}).
 The distinction is fairly subtle: in the end, both rule types are supposed to 
 define a method {{apply(plan: LogicalPlan)}}
 (where LogicalPlan is either Logical- or Spark-) which returns a transformed 
 plan (or a sequence thereof, in the case
 of Strategy).
 Ceremonies asides, the body of such method is always of the kind:
 {code:java}
  def apply(plan: PlanType) = plan match pf
 {code}
 where `pf` would be some `PartialFunction` of the PlanType:
 {code:java}
   val pf = {
 case ... = ...
   }
 {code}
 This is JIRA is a proposal to introduce utility methods to
   a) reduce the boilerplate to define rewrite rules
   b) turning them back into what they essentially represent: function types.
 These changes would be backwards compatible, and would greatly help in 
 understanding what the code does. Current use of objects is redundant and 
 possibly confusing.
 *{{Rule[LogicalPlan]}}*
 a) Introduce the utility object
 {code:java}
   object rule 
 def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
 Rule[LogicalPlan] =
   new Rule[LogicalPlan] {
 def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
   }
 def named(name: String)(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
 Rule[LogicalPlan] =
   new Rule[LogicalPlan] {
 override val ruleName = name
 def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
   }
 {code}
 b) progressively replace the boilerplate-y object definitions; e.g.
 {code:java}
 object MyRewriteRule extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 case ... = ...
 }
 {code}
 with
 {code:java}
 // define a Rule[LogicalPlan]
 val MyRewriteRule = rule {
   case ... = ...
 }
 {code}
 and/or :
 {code:java}
 // define a named Rule[LogicalPlan]
 val MyRewriteRule = rule.named(My rewrite rule) {
   case ... = ...
 }
 {code}
 *Strategies*
 A similar solution could be applied to shorten the code for
 Strategies, which are total functions
 only because they are all supposed to manage the default case,
 possibly returning `Nil`. In this case
 we might introduce the following utility:
 {code:java}
 object strategy {
   /**
* Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan].
* The partial function must therefore return *one single* SparkPlan for 
 each case.
* The method will automatically wrap them in a [[Seq]].
* Unhandled cases will automatically return Seq.empty
*/
   def apply(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy =
 new Strategy {
   def apply(plan: LogicalPlan): Seq[SparkPlan] =
 if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty
 }
   /**
* Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] 
 ].
* The partial function must therefore return a Seq[SparkPlan] for each 
 case.
* Unhandled cases will automatically return Seq.empty
*/
  def seq(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy =
 new Strategy {
   def apply(plan: LogicalPlan): Seq[SparkPlan] =
 if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan]
 }
 }
 {code}
 Usage:
 {code:java}
 val mystrategy = strategy { case ... = ... }
 val seqstrategy = strategy.seq { case ... = ... }
 {code}
 *Further possible improvements:*
 Making the utility methods `implicit`, thereby
 further reducing the rewrite rules to:
 {code:java}
 // define a PartialFunction[LogicalPlan, LogicalPlan]
 // the implicit would convert it into a Rule[LogicalPlan] at the use sites
 val MyRewriteRule = {
   case ... = ...
 }
 {code}
 *Caveats*
 Because of the way objects are initialized vs. vals, it might be necessary
 reorder instructions so that vals are actually initialized before they are 
 used.
 E.g.:
 {code:java}
 class MyOptimizer extends Optimizer {
   override val batches: Seq[Batch] =
   ...
   Batch(Other rules, FixedPoint(100),
 MyRewriteRule // --- might throw NPE
   val MyRewriteRule = ...
 }
 {code}
 this is easily fixed by factoring rules out as follows:
 {code:java}
 class MyOptimizer extends Optimizer with CustomRules {
   val MyRewriteRule = ...
   override val batches:

[jira] [Commented] (SPARK-7737) parquet schema discovery should not fail because of empty _temporary dir


[ 
https://issues.apache.org/jira/browse/SPARK-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554972#comment-14554972
 ] 

Apache Spark commented on SPARK-7737:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/6329

 parquet schema discovery should not fail because of empty _temporary dir 
 -

 Key: SPARK-7737
 URL: https://issues.apache.org/jira/browse/SPARK-7737
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker

 Parquet schema discovery will fail when the dir is like 
 {code}
 /partitions5k/i=2/_SUCCESS
 /partitions5k/i=2/_temporary/
 /partitions5k/i=2/part-r-1.gz.parquet
 /partitions5k/i=2/part-r-2.gz.parquet
 /partitions5k/i=2/part-r-3.gz.parquet
 /partitions5k/i=2/part-r-4.gz.parquet
 {code}
 {code}
 java.lang.AssertionError: assertion failed: Conflicting partition column 
 names detected:
   
   at scala.Predef$.assert(Predef.scala:179)
   at 
 org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:159)
   at 
 org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:71)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:468)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:424)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:423)
   at scala.Option.getOrElse(Option.scala:120)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:422)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:482)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:480)
   at 
 org.apache.spark.sql.sources.LogicalRelation.init(LogicalRelation.scala:30)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:134)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:118)
   at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1135)
 {code}
 1.3 works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7737) parquet schema discovery should not fail because of empty _temporary dir


 [ 
https://issues.apache.org/jira/browse/SPARK-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7737:
---

Assignee: Apache Spark  (was: Yin Huai)

 parquet schema discovery should not fail because of empty _temporary dir 
 -

 Key: SPARK-7737
 URL: https://issues.apache.org/jira/browse/SPARK-7737
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Apache Spark
Priority: Blocker

 Parquet schema discovery will fail when the dir is like 
 {code}
 /partitions5k/i=2/_SUCCESS
 /partitions5k/i=2/_temporary/
 /partitions5k/i=2/part-r-1.gz.parquet
 /partitions5k/i=2/part-r-2.gz.parquet
 /partitions5k/i=2/part-r-3.gz.parquet
 /partitions5k/i=2/part-r-4.gz.parquet
 {code}
 {code}
 java.lang.AssertionError: assertion failed: Conflicting partition column 
 names detected:
   
   at scala.Predef$.assert(Predef.scala:179)
   at 
 org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:159)
   at 
 org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:71)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:468)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:424)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:423)
   at scala.Option.getOrElse(Option.scala:120)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:422)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:482)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:480)
   at 
 org.apache.spark.sql.sources.LogicalRelation.init(LogicalRelation.scala:30)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:134)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:118)
   at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1135)
 {code}
 1.3 works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7737) parquet schema discovery should not fail because of empty _temporary dir


 [ 
https://issues.apache.org/jira/browse/SPARK-7737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7737:
---

Assignee: Yin Huai  (was: Apache Spark)

 parquet schema discovery should not fail because of empty _temporary dir 
 -

 Key: SPARK-7737
 URL: https://issues.apache.org/jira/browse/SPARK-7737
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker

 Parquet schema discovery will fail when the dir is like 
 {code}
 /partitions5k/i=2/_SUCCESS
 /partitions5k/i=2/_temporary/
 /partitions5k/i=2/part-r-1.gz.parquet
 /partitions5k/i=2/part-r-2.gz.parquet
 /partitions5k/i=2/part-r-3.gz.parquet
 /partitions5k/i=2/part-r-4.gz.parquet
 {code}
 {code}
 java.lang.AssertionError: assertion failed: Conflicting partition column 
 names detected:
   
   at scala.Predef$.assert(Predef.scala:179)
   at 
 org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:159)
   at 
 org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:71)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:468)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:424)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:423)
   at scala.Option.getOrElse(Option.scala:120)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:422)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:482)
   at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:480)
   at 
 org.apache.spark.sql.sources.LogicalRelation.init(LogicalRelation.scala:30)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:134)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:118)
   at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1135)
 {code}
 1.3 works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7616) Column order can be corrupted when saving DataFrame as a partitioned table