[jira] [Resolved] (SPARK-15529) Replace SQLContext and HiveContext with SparkSession in Test

2016-05-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15529.
-
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.0.0

> Replace SQLContext and HiveContext with SparkSession in Test
> 
>
> Key: SPARK-15529
> URL: https://issues.apache.org/jira/browse/SPARK-15529
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Use the latest Sparksession to replace the existing SQLContext and 
> HiveContext in test cases.
> No change will be made in the following suites:
> {{listTablesSuite}} is to test the APIs of {{SQLContext}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15598) Change Aggregator.zero to Aggregator.init

2016-05-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303528#comment-15303528
 ] 

Reynold Xin commented on SPARK-15598:
-

Yup I think we'd need to refactor the internals.


> Change Aggregator.zero to Aggregator.init
> -
>
> Key: SPARK-15598
> URL: https://issues.apache.org/jira/browse/SPARK-15598
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> org.apache.spark.sql.expressions.Aggregator currently requires defining the 
> zero value for an aggregator. This is actually a limitation making it 
> difficult to implement APIs such as reduce. In reduce (or reduceByKey), a 
> single associative and commutative reduce function is specified by the user, 
> and there is no definition of zero value.
> A small tweak to the API is to change zero to init, taking an input, similar 
> to the following:
> {code}
> abstract class Aggregator[-IN, BUF, OUT] extends Serializable {
>   def init(a: IN): BUF
>   def reduce(b: BUF, a: IN): BUF
>   def merge(b1: BUF, b2: BUF): BUF
>   def finish(reduction: BUF): OUT
> }
> {code}
> Then reduce can be implemented using:
> {code}
> f: (T, T) => T
> new Aggregator[T, T, T] {
>   override def init(a: T): T = identify
>   override def reduce(b: T, a: T): T = f(b, a)
>   override def merge(b1: T, b2: T): T = f(b1, b2)
>   override def finish(reduction: T): T = identify
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-26 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303523#comment-15303523
 ] 

Shivaram Venkataraman commented on SPARK-15585:
---

I am not sure i completely understand the question - The way the options get 
passed in R [1]  is that we create a hash map and fill it in with anything 
passed in by the user. `NULL` is a restricted keyword in R (note that its in 
all caps), and it gets deserialized / passed as `null` to Scala.

[1] 
https://github.com/apache/spark/blob/c82883239eadc4615a3aba907cd4633cb7aed26e/R/pkg/R/SQLContext.R#L658

> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15575) Remove breeze from dependencies?

2016-05-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303465#comment-15303465
 ] 

Yanbo Liang commented on SPARK-15575:
-

I'm interested in this and would like to take a look.

> Remove breeze from dependencies?
> 
>
> Key: SPARK-15575
> URL: https://issues.apache.org/jira/browse/SPARK-15575
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for discussing whether we should remove Breeze from the 
> dependencies of MLlib.  The main issues with Breeze are Scala 2.12 support 
> and performance issues.
> There are a few paths:
> # Keep dependency.  This could be OK, especially if the Scala version issues 
> are fixed within Breeze.
> # Remove dependency
> ## Implement our own linear algebra operators as needed
> ## Design a way to build Spark using custom linalg libraries of the user's 
> choice.  E.g., you could build MLlib using Breeze, or any other library 
> supporting the required operations.  This might require significant work.  
> See [SPARK-6442] for related discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15564) App name is the main class name in Spark streaming jobs

2016-05-26 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303442#comment-15303442
 ] 

Saisai Shao commented on SPARK-15564:
-

According to the your description, I guess you're running streaming application 
on yarn cluster mode?

If so you need to set application name through {{--name}} or set 
{{spark.app.name}} in conf file / {{--conf}}. Since in yarn cluster mode, 
yarn/client starts before driver started, and it will set the app name in yarn 
{{ApplicationSubmissionContext}}, at that time app name is not available, so it 
will pick class name instead.


> App name is the main class name in Spark streaming jobs
> ---
>
> Key: SPARK-15564
> URL: https://issues.apache.org/jira/browse/SPARK-15564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Steven Lowenthal
>Priority: Minor
>
> I've tried everything to set the app name to something other than the class 
> name of the job, but spark reports the application name as the class.  This 
> adversely affects the ability to monitor jobs, we can't have dots in the 
> reported app name. 
> {code:title=job.scala}
>   val defaultAppName = "NDS Transform"
>conf.setAppName(defaultAppName)
>println (s"App Name: ${conf.get("spark.app.name")}")
>   ...
>   val ssc = new StreamingContext(conf, streamingBatchWindow)
> {code}
> {code:title=output}
> App Name: NDS Transform
> {code}
> Application IDName
> app-20160526161230-0017 (kill)  com.gracenote.ongo.spark.NDSStreamAvro



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15564) App name is the main class name in Spark streaming jobs

2016-05-26 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303442#comment-15303442
 ] 

Saisai Shao edited comment on SPARK-15564 at 5/27/16 4:04 AM:
--

According to the your description, I guess you're running streaming application 
on yarn cluster mode?

If so you need to set application name through {{--name}} or set 
{{spark.app.name}} in conf file / {{--conf}}. Since in yarn cluster mode, 
yarn/client starts before driver started, and it will set the app name in yarn 
{{ApplicationSubmissionContext}}, at that time app name is not available, so it 
will pick class name instead.

So from my understanding it is by design.


was (Author: jerryshao):
According to the your description, I guess you're running streaming application 
on yarn cluster mode?

If so you need to set application name through {{--name}} or set 
{{spark.app.name}} in conf file / {{--conf}}. Since in yarn cluster mode, 
yarn/client starts before driver started, and it will set the app name in yarn 
{{ApplicationSubmissionContext}}, at that time app name is not available, so it 
will pick class name instead.


> App name is the main class name in Spark streaming jobs
> ---
>
> Key: SPARK-15564
> URL: https://issues.apache.org/jira/browse/SPARK-15564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Steven Lowenthal
>Priority: Minor
>
> I've tried everything to set the app name to something other than the class 
> name of the job, but spark reports the application name as the class.  This 
> adversely affects the ability to monitor jobs, we can't have dots in the 
> reported app name. 
> {code:title=job.scala}
>   val defaultAppName = "NDS Transform"
>conf.setAppName(defaultAppName)
>println (s"App Name: ${conf.get("spark.app.name")}")
>   ...
>   val ssc = new StreamingContext(conf, streamingBatchWindow)
> {code}
> {code:title=output}
> App Name: NDS Transform
> {code}
> Application IDName
> app-20160526161230-0017 (kill)  com.gracenote.ongo.spark.NDSStreamAvro



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8603) In Windows,Not able to create a Spark context from R studio

2016-05-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8603.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13165
[https://github.com/apache/spark/pull/13165]

> In Windows,Not able to create a Spark context from R studio 
> 
>
> Key: SPARK-8603
> URL: https://issues.apache.org/jira/browse/SPARK-8603
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
> Environment: Windows, R studio
>Reporter: Prakash Ponshankaarchinnusamy
> Fix For: 2.0.0
>
>   Original Estimate: 0.5m
>  Remaining Estimate: 0.5m
>
> In windows ,creation of spark context fails using below code from R studio
> Sys.setenv(SPARK_HOME="C:\\spark\\spark-1.4.0")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> sc <- sparkR.init(master="spark://localhost:7077", appName="SparkR")
> Error: JVM is not ready after 10 seconds
> Reason: Wrong file path computed in client.R. File seperator for windows["\"] 
> is not respected by "file.Path" function by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8603) In Windows,Not able to create a Spark context from R studio

2016-05-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-8603:
-
Assignee: Hyukjin Kwon

> In Windows,Not able to create a Spark context from R studio 
> 
>
> Key: SPARK-8603
> URL: https://issues.apache.org/jira/browse/SPARK-8603
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
> Environment: Windows, R studio
>Reporter: Prakash Ponshankaarchinnusamy
>Assignee: Hyukjin Kwon
> Fix For: 2.0.0
>
>   Original Estimate: 0.5m
>  Remaining Estimate: 0.5m
>
> In windows ,creation of spark context fails using below code from R studio
> Sys.setenv(SPARK_HOME="C:\\spark\\spark-1.4.0")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> sc <- sparkR.init(master="spark://localhost:7077", appName="SparkR")
> Error: JVM is not ready after 10 seconds
> Reason: Wrong file path computed in client.R. File seperator for windows["\"] 
> is not respected by "file.Path" function by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15584:


Assignee: Dongjoon Hyun  (was: Apache Spark)

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15584:


Assignee: Apache Spark  (was: Dongjoon Hyun)

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Minor
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303427#comment-15303427
 ] 

Apache Spark commented on SPARK-15584:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13349

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15598) Change Aggregator.zero to Aggregator.init

2016-05-26 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303418#comment-15303418
 ] 

koert kuipers commented on SPARK-15598:
---

the reason i ask is that if you plan to do:
{noformat}
inputs.foldLeft(aggregator.init(input.head))(aggregator.reduce _)
{noformat}
then i think your implementation of reduce using an Aggregator will not work, 
since the first element gets added twice.

but looking at the code for TypedAggregateExpression and DeclarativeAggregate 
the alternative involves serious changes... 


> Change Aggregator.zero to Aggregator.init
> -
>
> Key: SPARK-15598
> URL: https://issues.apache.org/jira/browse/SPARK-15598
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> org.apache.spark.sql.expressions.Aggregator currently requires defining the 
> zero value for an aggregator. This is actually a limitation making it 
> difficult to implement APIs such as reduce. In reduce (or reduceByKey), a 
> single associative and commutative reduce function is specified by the user, 
> and there is no definition of zero value.
> A small tweak to the API is to change zero to init, taking an input, similar 
> to the following:
> {code}
> abstract class Aggregator[-IN, BUF, OUT] extends Serializable {
>   def init(a: IN): BUF
>   def reduce(b: BUF, a: IN): BUF
>   def merge(b1: BUF, b2: BUF): BUF
>   def finish(reduction: BUF): OUT
> }
> {code}
> Then reduce can be implemented using:
> {code}
> f: (T, T) => T
> new Aggregator[T, T, T] {
>   override def init(a: T): T = identify
>   override def reduce(b: T, a: T): T = f(b, a)
>   override def merge(b1: T, b2: T): T = f(b1, b2)
>   override def finish(reduction: T): T = identify
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15598) Change Aggregator.zero to Aggregator.init

2016-05-26 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303396#comment-15303396
 ] 

koert kuipers edited comment on SPARK-15598 at 5/27/16 3:22 AM:


just to be clear, your intention is to use it roughly as follows per partition, 
illustrated with a list:
{noformat}
val inputs: List[IN] = ...
inputs.tail.foldLeft(aggregator.init(input.head))(aggregator.reduce _)
{noformat}
or alternatively since reduce is redundant (but perhaps this is less efficient):
{noformat}
inputs.map(aggregator.init).reduce(aggregator.merge)
{noformat}

or did i misunderstand and your intention is instead:
{noformat}
inputs.foldLeft(aggregator.init(input.head))(aggregator.reduce _)
{noformat}
(different in that first element is passed both into init and in reduce)


was (Author: koert):
just to be clear, your intention is to use it roughly as follows per partition, 
illustrated with a list:
{noformat}
val inputs: List[IN] = ...
inputs.tail.foldLeft(aggregator.init(input.head))(aggregator.reduce _)
{noformat}
or alternatively since reduce is redundant (but perhaps this is less efficient):
{noformat}
inputs.map(aggregator.init).reduce(aggregator.merge)
{noformat}

> Change Aggregator.zero to Aggregator.init
> -
>
> Key: SPARK-15598
> URL: https://issues.apache.org/jira/browse/SPARK-15598
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> org.apache.spark.sql.expressions.Aggregator currently requires defining the 
> zero value for an aggregator. This is actually a limitation making it 
> difficult to implement APIs such as reduce. In reduce (or reduceByKey), a 
> single associative and commutative reduce function is specified by the user, 
> and there is no definition of zero value.
> A small tweak to the API is to change zero to init, taking an input, similar 
> to the following:
> {code}
> abstract class Aggregator[-IN, BUF, OUT] extends Serializable {
>   def init(a: IN): BUF
>   def reduce(b: BUF, a: IN): BUF
>   def merge(b1: BUF, b2: BUF): BUF
>   def finish(reduction: BUF): OUT
> }
> {code}
> Then reduce can be implemented using:
> {code}
> f: (T, T) => T
> new Aggregator[T, T, T] {
>   override def init(a: T): T = identify
>   override def reduce(b: T, a: T): T = f(b, a)
>   override def merge(b1: T, b2: T): T = f(b1, b2)
>   override def finish(reduction: T): T = identify
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15598) Change Aggregator.zero to Aggregator.init

2016-05-26 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303396#comment-15303396
 ] 

koert kuipers commented on SPARK-15598:
---

just to be clear, your intention is to use it roughly as follows per partition, 
illustrated with a list:
{noformat}
val inputs: List[IN] = ...
inputs.tail.foldLeft(aggregator.init(input.head))(aggregator.reduce _)
{noformat}
or alternatively since reduce is redundant (but perhaps this is less efficient):
{noformat}
inputs.map(aggregator.init).reduce(aggregator.merge)
{noformat}

> Change Aggregator.zero to Aggregator.init
> -
>
> Key: SPARK-15598
> URL: https://issues.apache.org/jira/browse/SPARK-15598
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> org.apache.spark.sql.expressions.Aggregator currently requires defining the 
> zero value for an aggregator. This is actually a limitation making it 
> difficult to implement APIs such as reduce. In reduce (or reduceByKey), a 
> single associative and commutative reduce function is specified by the user, 
> and there is no definition of zero value.
> A small tweak to the API is to change zero to init, taking an input, similar 
> to the following:
> {code}
> abstract class Aggregator[-IN, BUF, OUT] extends Serializable {
>   def init(a: IN): BUF
>   def reduce(b: BUF, a: IN): BUF
>   def merge(b1: BUF, b2: BUF): BUF
>   def finish(reduction: BUF): OUT
> }
> {code}
> Then reduce can be implemented using:
> {code}
> f: (T, T) => T
> new Aggregator[T, T, T] {
>   override def init(a: T): T = identify
>   override def reduce(b: T, a: T): T = f(b, a)
>   override def merge(b1: T, b2: T): T = f(b1, b2)
>   override def finish(reduction: T): T = identify
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15583) Relax ALTER TABLE properties restriction for data source tables

2016-05-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15583.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13341
[https://github.com/apache/spark/pull/13341]

> Relax ALTER TABLE properties restriction for data source tables
> ---
>
> Key: SPARK-15583
> URL: https://issues.apache.org/jira/browse/SPARK-15583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Looks like right now we just don't support ALTER TABLE SET TBLPROPERTIES for 
> all properties. This is overly restrictive; as long as the user doesn't touch 
> anything in the special namespace (spark.sql.sources.*) then we're OK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303391#comment-15303391
 ] 

Apache Spark commented on SPARK-15565:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/13348

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15565:


Assignee: Apache Spark

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Critical
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15565:


Assignee: (was: Apache Spark)

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15598) Change Aggregator.zero to Aggregator.init

2016-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303374#comment-15303374
 ] 

Apache Spark commented on SPARK-15598:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/13347

> Change Aggregator.zero to Aggregator.init
> -
>
> Key: SPARK-15598
> URL: https://issues.apache.org/jira/browse/SPARK-15598
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> org.apache.spark.sql.expressions.Aggregator currently requires defining the 
> zero value for an aggregator. This is actually a limitation making it 
> difficult to implement APIs such as reduce. In reduce (or reduceByKey), a 
> single associative and commutative reduce function is specified by the user, 
> and there is no definition of zero value.
> A small tweak to the API is to change zero to init, taking an input, similar 
> to the following:
> {code}
> abstract class Aggregator[-IN, BUF, OUT] extends Serializable {
>   def init(a: IN): BUF
>   def reduce(b: BUF, a: IN): BUF
>   def merge(b1: BUF, b2: BUF): BUF
>   def finish(reduction: BUF): OUT
> }
> {code}
> Then reduce can be implemented using:
> {code}
> f: (T, T) => T
> new Aggregator[T, T, T] {
>   override def init(a: T): T = identify
>   override def reduce(b: T, a: T): T = f(b, a)
>   override def merge(b1: T, b2: T): T = f(b1, b2)
>   override def finish(reduction: T): T = identify
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15598) Change Aggregator.zero to Aggregator.init

2016-05-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303383#comment-15303383
 ] 

Reynold Xin commented on SPARK-15598:
-

That separation is kept for performance reasons. Otherwise we would have to
invoke two function calls for each record.




> Change Aggregator.zero to Aggregator.init
> -
>
> Key: SPARK-15598
> URL: https://issues.apache.org/jira/browse/SPARK-15598
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> org.apache.spark.sql.expressions.Aggregator currently requires defining the 
> zero value for an aggregator. This is actually a limitation making it 
> difficult to implement APIs such as reduce. In reduce (or reduceByKey), a 
> single associative and commutative reduce function is specified by the user, 
> and there is no definition of zero value.
> A small tweak to the API is to change zero to init, taking an input, similar 
> to the following:
> {code}
> abstract class Aggregator[-IN, BUF, OUT] extends Serializable {
>   def init(a: IN): BUF
>   def reduce(b: BUF, a: IN): BUF
>   def merge(b1: BUF, b2: BUF): BUF
>   def finish(reduction: BUF): OUT
> }
> {code}
> Then reduce can be implemented using:
> {code}
> f: (T, T) => T
> new Aggregator[T, T, T] {
>   override def init(a: T): T = identify
>   override def reduce(b: T, a: T): T = f(b, a)
>   override def merge(b1: T, b2: T): T = f(b1, b2)
>   override def finish(reduction: T): T = identify
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15598) Change Aggregator.zero to Aggregator.init

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15598:


Assignee: (was: Apache Spark)

> Change Aggregator.zero to Aggregator.init
> -
>
> Key: SPARK-15598
> URL: https://issues.apache.org/jira/browse/SPARK-15598
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> org.apache.spark.sql.expressions.Aggregator currently requires defining the 
> zero value for an aggregator. This is actually a limitation making it 
> difficult to implement APIs such as reduce. In reduce (or reduceByKey), a 
> single associative and commutative reduce function is specified by the user, 
> and there is no definition of zero value.
> A small tweak to the API is to change zero to init, taking an input, similar 
> to the following:
> {code}
> abstract class Aggregator[-IN, BUF, OUT] extends Serializable {
>   def init(a: IN): BUF
>   def reduce(b: BUF, a: IN): BUF
>   def merge(b1: BUF, b2: BUF): BUF
>   def finish(reduction: BUF): OUT
> }
> {code}
> Then reduce can be implemented using:
> {code}
> f: (T, T) => T
> new Aggregator[T, T, T] {
>   override def init(a: T): T = identify
>   override def reduce(b: T, a: T): T = f(b, a)
>   override def merge(b1: T, b2: T): T = f(b1, b2)
>   override def finish(reduction: T): T = identify
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15598) Change Aggregator.zero to Aggregator.init

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15598:


Assignee: Apache Spark

> Change Aggregator.zero to Aggregator.init
> -
>
> Key: SPARK-15598
> URL: https://issues.apache.org/jira/browse/SPARK-15598
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> org.apache.spark.sql.expressions.Aggregator currently requires defining the 
> zero value for an aggregator. This is actually a limitation making it 
> difficult to implement APIs such as reduce. In reduce (or reduceByKey), a 
> single associative and commutative reduce function is specified by the user, 
> and there is no definition of zero value.
> A small tweak to the API is to change zero to init, taking an input, similar 
> to the following:
> {code}
> abstract class Aggregator[-IN, BUF, OUT] extends Serializable {
>   def init(a: IN): BUF
>   def reduce(b: BUF, a: IN): BUF
>   def merge(b1: BUF, b2: BUF): BUF
>   def finish(reduction: BUF): OUT
> }
> {code}
> Then reduce can be implemented using:
> {code}
> f: (T, T) => T
> new Aggregator[T, T, T] {
>   override def init(a: T): T = identify
>   override def reduce(b: T, a: T): T = f(b, a)
>   override def merge(b1: T, b2: T): T = f(b1, b2)
>   override def finish(reduction: T): T = identify
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15598) Change Aggregator.zero to Aggregator.init

2016-05-26 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303375#comment-15303375
 ] 

koert kuipers commented on SPARK-15598:
---

this makes a lot of sense, and is consistent with algebird's Aggregator (while 
the current implementation is more like algebird's MonoidAggregator).

do you still need the reduce method? you could provide a reasonable default as:
def reduce(b: BUF, a: IN): BUF = merge(b, init(in))


> Change Aggregator.zero to Aggregator.init
> -
>
> Key: SPARK-15598
> URL: https://issues.apache.org/jira/browse/SPARK-15598
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> org.apache.spark.sql.expressions.Aggregator currently requires defining the 
> zero value for an aggregator. This is actually a limitation making it 
> difficult to implement APIs such as reduce. In reduce (or reduceByKey), a 
> single associative and commutative reduce function is specified by the user, 
> and there is no definition of zero value.
> A small tweak to the API is to change zero to init, taking an input, similar 
> to the following:
> {code}
> abstract class Aggregator[-IN, BUF, OUT] extends Serializable {
>   def init(a: IN): BUF
>   def reduce(b: BUF, a: IN): BUF
>   def merge(b1: BUF, b2: BUF): BUF
>   def finish(reduction: BUF): OUT
> }
> {code}
> Then reduce can be implemented using:
> {code}
> f: (T, T) => T
> new Aggregator[T, T, T] {
>   override def init(a: T): T = identify
>   override def reduce(b: T, a: T): T = f(b, a)
>   override def merge(b1: T, b2: T): T = f(b1, b2)
>   override def finish(reduction: T): T = identify
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15542) Make error message clear for script './R/install-dev.sh' when R is missing on Mac

2016-05-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15542:
--
Assignee: Xin Ren

> Make error message clear for script './R/install-dev.sh' when R is missing on 
> Mac
> -
>
> Key: SPARK-15542
> URL: https://issues.apache.org/jira/browse/SPARK-15542
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.0.0
> Environment: Mac OS EI Captain
>Reporter: Xin Ren
>Assignee: Xin Ren
>Priority: Minor
> Fix For: 2.0.0
>
>
> I followed instructions here https://github.com/apache/spark/tree/master/R to 
> build sparkR project. When running {code}build/mvn -DskipTests -Psparkr 
> package{code} then I got error below:
> {code}
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [ 23.589 
> s]
> [INFO] Spark Project Tags . SUCCESS [ 19.389 
> s]
> #!/bin/bash
> [INFO] Spark Project Sketch ... SUCCESS [  6.386 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 12.296 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  7.817 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [ 10.825 
> s]
> [INFO] Spark Project Launcher . SUCCESS [ 12.262 
> s]
> [INFO] Spark Project Core . FAILURE [01:40 
> min]
> [INFO] Spark Project GraphX ... SKIPPED
> [INFO] Spark Project Streaming  SKIPPED
> [INFO] Spark Project Catalyst . SKIPPED
> [INFO] Spark Project SQL .. SKIPPED
> [INFO] Spark Project ML Local Library . SKIPPED
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Spark Project External Flume Sink .. SKIPPED
> [INFO] Spark Project External Flume ... SKIPPED
> [INFO] Spark Project External Flume Assembly .. SKIPPED
> [INFO] Spark Integration for Kafka 0.8  SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Project External Kafka Assembly .. SKIPPED
> [INFO] Spark Project Java 8 Tests . SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> #!/bin/bash
> [INFO] 
> 
> [INFO] Total time: 03:14 min
> [INFO] Finished at: 2016-05-25T21:51:58+00:00
> [INFO] Final Memory: 55M/782M
> [INFO] 
> 
> [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:exec 
> (sparkr-pkg) on project spark-core_2.11: Command execution failed. Process 
> exited with an error: 1 (Exit value: 1) -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR]   mvn  -rf :spark-core_2.11
> {code}
> and this error turned to be caused by {code}./R/install-dev.sh{code}
> then I directly run this install-dev.sh script, and got 
> {code}
> mbp185-xr:spark xin$ ./R/install-dev.sh
> usage: dirname path
> {code}
> This message is very confusing to me, and then I found R is not properly 
> configured on my Mac when this script is using {code}$(which R){code} to get 
> R home.
> I tried similar situation on CentOS with R missing, and it's giving me very 
> clear error message while MacOS is not.
> on CentOS: {code}
> [root@ip-xxx-31-9-xx spark]# which R
> /usr/bin/which: no R in 
> (/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin:/root/bin){code}
> but on Mac, if not found then nothing returned and this is causing the 
> confusing message for R build failure and running R/install-dev.sh: {code}

[jira] [Resolved] (SPARK-15542) Make error message clear for script './R/install-dev.sh' when R is missing on Mac

2016-05-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15542.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13308
[https://github.com/apache/spark/pull/13308]

> Make error message clear for script './R/install-dev.sh' when R is missing on 
> Mac
> -
>
> Key: SPARK-15542
> URL: https://issues.apache.org/jira/browse/SPARK-15542
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.0.0
> Environment: Mac OS EI Captain
>Reporter: Xin Ren
>Priority: Minor
> Fix For: 2.0.0
>
>
> I followed instructions here https://github.com/apache/spark/tree/master/R to 
> build sparkR project. When running {code}build/mvn -DskipTests -Psparkr 
> package{code} then I got error below:
> {code}
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [ 23.589 
> s]
> [INFO] Spark Project Tags . SUCCESS [ 19.389 
> s]
> #!/bin/bash
> [INFO] Spark Project Sketch ... SUCCESS [  6.386 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 12.296 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  7.817 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [ 10.825 
> s]
> [INFO] Spark Project Launcher . SUCCESS [ 12.262 
> s]
> [INFO] Spark Project Core . FAILURE [01:40 
> min]
> [INFO] Spark Project GraphX ... SKIPPED
> [INFO] Spark Project Streaming  SKIPPED
> [INFO] Spark Project Catalyst . SKIPPED
> [INFO] Spark Project SQL .. SKIPPED
> [INFO] Spark Project ML Local Library . SKIPPED
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Spark Project External Flume Sink .. SKIPPED
> [INFO] Spark Project External Flume ... SKIPPED
> [INFO] Spark Project External Flume Assembly .. SKIPPED
> [INFO] Spark Integration for Kafka 0.8  SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Project External Kafka Assembly .. SKIPPED
> [INFO] Spark Project Java 8 Tests . SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> #!/bin/bash
> [INFO] 
> 
> [INFO] Total time: 03:14 min
> [INFO] Finished at: 2016-05-25T21:51:58+00:00
> [INFO] Final Memory: 55M/782M
> [INFO] 
> 
> [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:exec 
> (sparkr-pkg) on project spark-core_2.11: Command execution failed. Process 
> exited with an error: 1 (Exit value: 1) -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR]   mvn  -rf :spark-core_2.11
> {code}
> and this error turned to be caused by {code}./R/install-dev.sh{code}
> then I directly run this install-dev.sh script, and got 
> {code}
> mbp185-xr:spark xin$ ./R/install-dev.sh
> usage: dirname path
> {code}
> This message is very confusing to me, and then I found R is not properly 
> configured on my Mac when this script is using {code}$(which R){code} to get 
> R home.
> I tried similar situation on CentOS with R missing, and it's giving me very 
> clear error message while MacOS is not.
> on CentOS: {code}
> [root@ip-xxx-31-9-xx spark]# which R
> /usr/bin/which: no R in 
> (/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin:/root/bin){code}
> but on Mac, if not found then nothing returned and this is causing 

[jira] [Assigned] (SPARK-15599) Document createDataset functions in SparkSession

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15599:


Assignee: (was: Apache Spark)

> Document createDataset functions in SparkSession
> 
>
> Key: SPARK-15599
> URL: https://issues.apache.org/jira/browse/SPARK-15599
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15599) Document createDataset functions in SparkSession

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15599:


Assignee: Apache Spark

> Document createDataset functions in SparkSession
> 
>
> Key: SPARK-15599
> URL: https://issues.apache.org/jira/browse/SPARK-15599
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15599) Document createDataset functions in SparkSession

2016-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303344#comment-15303344
 ] 

Apache Spark commented on SPARK-15599:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/13345

> Document createDataset functions in SparkSession
> 
>
> Key: SPARK-15599
> URL: https://issues.apache.org/jira/browse/SPARK-15599
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15599) Document createDataset functions in SparkSession

2016-05-26 Thread Sameer Agarwal (JIRA)
Sameer Agarwal created SPARK-15599:
--

 Summary: Document createDataset functions in SparkSession
 Key: SPARK-15599
 URL: https://issues.apache.org/jira/browse/SPARK-15599
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Sameer Agarwal






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15536) Disallow TRUNCATE TABLE with external tables and views

2016-05-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15536.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Disallow TRUNCATE TABLE with external tables and views
> --
>
> Key: SPARK-15536
> URL: https://issues.apache.org/jira/browse/SPARK-15536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Otherwise we might accidentally delete existing data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15536) Disallow TRUNCATE TABLE with external tables and views

2016-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303336#comment-15303336
 ] 

Apache Spark commented on SPARK-15536:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/13315

> Disallow TRUNCATE TABLE with external tables and views
> --
>
> Key: SPARK-15536
> URL: https://issues.apache.org/jira/browse/SPARK-15536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Otherwise we might accidentally delete existing data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15538) Truncate table does not work on data source table

2016-05-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15538.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Truncate table does not work on data source table
> -
>
> Key: SPARK-15538
> URL: https://issues.apache.org/jira/browse/SPARK-15538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Suresh Thalamati
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 2.0.0
>
>
> Truncate table does not seem to work on data source tables.
> Repro:
> {code}
> val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", 
> "CA")).toDF("id", "name", "state")
> df.write.format("parquet").partitionBy("state").saveAsTable("emp")
> scala> sql("truncate table emp") 
> res8: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from emp").show() // FileNotFoundException
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15562) Temp directory is not deleted after program exit in DataFrameExample

2016-05-26 Thread ding (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303326#comment-15303326
 ] 

ding commented on SPARK-15562:
--

OK, I will check if there is similar JIRAs before creating a new one next time.

> Temp directory is not deleted after program exit in DataFrameExample
> 
>
> Key: SPARK-15562
> URL: https://issues.apache.org/jira/browse/SPARK-15562
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: ding
>Priority: Minor
>
> Temp directory used to save records is not deleted after program exit in 
> DataFrameExample. Although it called deleteOnExit, it doesn't work as the 
> directory is not empty. Similar things happend in ContextCleanerSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15598) Change Aggregator.zero to Aggregator.init

2016-05-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-15598:
---

 Summary: Change Aggregator.zero to Aggregator.init
 Key: SPARK-15598
 URL: https://issues.apache.org/jira/browse/SPARK-15598
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


org.apache.spark.sql.expressions.Aggregator currently requires defining the 
zero value for an aggregator. This is actually a limitation making it difficult 
to implement APIs such as reduce. In reduce (or reduceByKey), a single 
associative and commutative reduce function is specified by the user, and there 
is no definition of zero value.

A small tweak to the API is to change zero to init, taking an input, similar to 
the following:

{code}
abstract class Aggregator[-IN, BUF, OUT] extends Serializable {
  def init(a: IN): BUF
  def reduce(b: BUF, a: IN): BUF
  def merge(b1: BUF, b2: BUF): BUF
  def finish(reduction: BUF): OUT
}
{code}

Then reduce can be implemented using:

{code}
f: (T, T) => T

new Aggregator[T, T, T] {
  override def init(a: T): T = identify
  override def reduce(b: T, a: T): T = f(b, a)
  override def merge(b1: T, b2: T): T = f(b1, b2)
  override def finish(reduction: T): T = identify
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15597) Add SparkSession.emptyDataset

2016-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303316#comment-15303316
 ] 

Apache Spark commented on SPARK-15597:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13344

> Add SparkSession.emptyDataset
> -
>
> Key: SPARK-15597
> URL: https://issues.apache.org/jira/browse/SPARK-15597
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> SparkSession currently has emptyDataFrame, but not emptyDataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15597) Add SparkSession.emptyDataset

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15597:


Assignee: Reynold Xin  (was: Apache Spark)

> Add SparkSession.emptyDataset
> -
>
> Key: SPARK-15597
> URL: https://issues.apache.org/jira/browse/SPARK-15597
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> SparkSession currently has emptyDataFrame, but not emptyDataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15597) Add SparkSession.emptyDataset

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15597:


Assignee: Apache Spark  (was: Reynold Xin)

> Add SparkSession.emptyDataset
> -
>
> Key: SPARK-15597
> URL: https://issues.apache.org/jira/browse/SPARK-15597
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> SparkSession currently has emptyDataFrame, but not emptyDataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15597) Add SparkSession.emptyDataset

2016-05-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-15597:
---

 Summary: Add SparkSession.emptyDataset
 Key: SPARK-15597
 URL: https://issues.apache.org/jira/browse/SPARK-15597
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


SparkSession currently has emptyDataFrame, but not emptyDataset.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15595) DataFrame.write.mode(SaveMode.Append).insertInto(TABLE) overwrites non-partitioned table

2016-05-26 Thread Sudarshan Lamkhede (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudarshan Lamkhede resolved SPARK-15595.

Resolution: Invalid

Seems to be specific to the custom spark distribution I am using.

> DataFrame.write.mode(SaveMode.Append).insertInto(TABLE) overwrites 
> non-partitioned table
> 
>
> Key: SPARK-15595
> URL: https://issues.apache.org/jira/browse/SPARK-15595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sudarshan Lamkhede
>
> See the examples below
> {noformat}
> scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS noparts (model_name 
> STRING, dateint INT) STORED AS PARQUET""")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS parts (model_name STRING) 
> PARTITIONED BY (dateint INT) STORED AS PARQUET""")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> scala> sqlContext.sql("select * from noparts").show()
> +--+---+
> |model_name|dateint|
> +--+---+
> +--+---+
> scala> sqlContext.sql("select * from parts").show()
> +--+---+
> |model_name|dateint|
> +--+---+
> +--+---+
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df1 = sc.parallelize(Array(("before", 1)), 1).toDF("model_name", 
> "dateint")
> df1: org.apache.spark.sql.DataFrame = [model_name: string, dateint: int]
> scala> val df2 = sc.parallelize(Array(("after", 2)), 1).toDF("model_name", 
> "dateint")
> df2: org.apache.spark.sql.DataFrame = [model_name: string, dateint: int]
> scala> import org.apache.spark.sql.SaveMode
> import org.apache.spark.sql.SaveMode
> scala> df1.write.mode(SaveMode.Append).insertInto("noparts")
> {noformat}
> This inserts one record
> {noformat}
> scala> sqlContext.sql("select * from noparts").show()
> +--+---+
> |model_name|dateint|
> +--+---+
> |before|  1|
> +--+---+
> {noformat}
> But subsequent writes overwrite it
> {noformat}
> scala> df2.write.mode(SaveMode.Append).insertInto("noparts")
> scala> sqlContext.sql("select * from noparts").show()
> +--+---+
> |model_name|dateint|
> +--+---+
> | after|  2|
> +--+---+
> {noformat}
> That does not happen with partitioned table
> {noformat}
> scala> df1.write.mode(SaveMode.Append).insertInto("parts")
> scala> sqlContext.sql("select * from parts").show()
> +--+---+
> |model_name|dateint|
> +--+---+
> |before|  1|
> +--+---+
> scala> df2.write.mode(SaveMode.Append).insertInto("parts")
> scala> sqlContext.sql("select * from parts").show()
> +--+---+
> |model_name|dateint|
> +--+---+
> |before|  1|
> | after|  2|
> +--+---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15595) DataFrame.write.mode(SaveMode.Append).insertInto(TABLE) overwrites non-partitioned table

2016-05-26 Thread Sudarshan Lamkhede (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudarshan Lamkhede updated SPARK-15595:
---
Description: 
See the examples below
{noformat}
scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS noparts (model_name STRING, 
dateint INT) STORED AS PARQUET""")
res0: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS parts (model_name STRING) 
PARTITIONED BY (dateint INT) STORED AS PARQUET""")
res1: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
+--+---+


scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
+--+---+


scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> val df1 = sc.parallelize(Array(("before", 1)), 1).toDF("model_name", 
"dateint")
df1: org.apache.spark.sql.DataFrame = [model_name: string, dateint: int]

scala> val df2 = sc.parallelize(Array(("after", 2)), 1).toDF("model_name", 
"dateint")
df2: org.apache.spark.sql.DataFrame = [model_name: string, dateint: int]

scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode

scala> df1.write.mode(SaveMode.Append).insertInto("noparts")
{noformat}
This inserts one record
{noformat}
scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
+--+---+
{noformat}
But subsequent writes overwrite it
{noformat}
scala> df2.write.mode(SaveMode.Append).insertInto("noparts")

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
| after|  2|
+--+---+
{noformat}

That does not happen with partitioned table
{noformat}
scala> df1.write.mode(SaveMode.Append).insertInto("parts")

scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
+--+---+


scala> df2.write.mode(SaveMode.Append).insertInto("parts")

scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
| after|  2|
+--+---+

{noformat}

  was:
See the examples below
{noformat}
scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS noparts (model_name STRING, 
dateint INT) STORED AS PARQUET""")
res0: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS parts (model_name STRING) 
PARTITIONED BY (dateint INT) STORED AS PARQUET""")
res1: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
+--+---+


scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
+--+---+


scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> val df1 = sc.parallelize(Array(("before", 1)), 1).toDF("model_name", 
"dateint")
df1: org.apache.spark.sql.DataFrame = [model_name: string, dateint: int]

scala> val df2 = sc.parallelize(Array(("after", 2)), 1).toDF("model_name", 
"dateint")
df2: org.apache.spark.sql.DataFrame = [model_name: string, dateint: int]

scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode

scala> df1.write.mode(SaveMode.Append).insertInto("noparts")

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
+--+---+


scala> df2.write.mode(SaveMode.Append).insertInto("noparts")

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
| after|  2|
+--+---+


scala> df1.write.mode(SaveMode.Append).insertInto("parts")

scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
+--+---+


scala> df2.write.mode(SaveMode.Append).insertInto("parts")

scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
| after|  2|
+--+---+

{noformat}


> DataFrame.write.mode(SaveMode.Append).insertInto(TABLE) overwrites 
> non-partitioned table
> 
>
> Key: SPARK-15595
> URL: https://issues.apache.org/jira/browse/SPARK-15595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sudarshan Lamkhede
>
> See the examples below

[jira] [Updated] (SPARK-15595) DataFrame.write.mode(SaveMode.Append).insertInto(TABLE) overwrites non-partitioned table

2016-05-26 Thread Sudarshan Lamkhede (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudarshan Lamkhede updated SPARK-15595:
---
Description: 
See the examples below
{noformat}
scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS noparts (model_name STRING, 
dateint INT) STORED AS PARQUET""")
res0: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS parts (model_name STRING) 
PARTITIONED BY (dateint INT) STORED AS PARQUET""")
res1: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
+--+---+


scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
+--+---+


scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> val df1 = sc.parallelize(Array(("before", 1)), 1).toDF("model_name", 
"dateint")
df1: org.apache.spark.sql.DataFrame = [model_name: string, dateint: int]

scala> val df2 = sc.parallelize(Array(("after", 2)), 1).toDF("model_name", 
"dateint")
df2: org.apache.spark.sql.DataFrame = [model_name: string, dateint: int]

scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode

scala> df1.write.mode(SaveMode.Append).insertInto("noparts")

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
+--+---+


scala> df2.write.mode(SaveMode.Append).insertInto("noparts")

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
| after|  2|
+--+---+


scala> df1.write.mode(SaveMode.Append).insertInto("parts")

scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
+--+---+


scala> df2.write.mode(SaveMode.Append).insertInto("parts")

scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
| after|  2|
+--+---+

{noformat}

  was:
See the examples below

scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS noparts (model_name STRING, 
dateint INT) STORED AS PARQUET""")
res0: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS parts (model_name STRING) 
PARTITIONED BY (dateint INT) STORED AS PARQUET""")
res1: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
+--+---+


scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
+--+---+


scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> val df1 = sc.parallelize(Array(("before", 1)), 1).toDF("model_name", 
"dateint")
df1: org.apache.spark.sql.DataFrame = [model_name: string, dateint: int]

scala> val df2 = sc.parallelize(Array(("after", 2)), 1).toDF("model_name", 
"dateint")
df2: org.apache.spark.sql.DataFrame = [model_name: string, dateint: int]

scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode

scala> df1.write.mode(SaveMode.Append).insertInto("noparts")

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
+--+---+


scala> df2.write.mode(SaveMode.Append).insertInto("noparts")

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
| after|  2|
+--+---+


scala> df1.write.mode(SaveMode.Append).insertInto("parts")

scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
+--+---+


scala> df2.write.mode(SaveMode.Append).insertInto("parts")

scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
| after|  2|
+--+---+




> DataFrame.write.mode(SaveMode.Append).insertInto(TABLE) overwrites 
> non-partitioned table
> 
>
> Key: SPARK-15595
> URL: https://issues.apache.org/jira/browse/SPARK-15595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sudarshan Lamkhede
>
> See the examples below
> {noformat}
> scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS noparts (model_name 
> STRING, dateint INT) STORED AS PARQUET""")
> res0: org.apache.spark.sql.DataFrame = [result: 

[jira] [Created] (SPARK-15596) ALTER TABLE RENAME needs to uncache query

2016-05-26 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15596:
-

 Summary: ALTER TABLE RENAME needs to uncache query
 Key: SPARK-15596
 URL: https://issues.apache.org/jira/browse/SPARK-15596
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15595) DataFrame.write.mode(SaveMode.Append).insertInto(TABLE) overwrites non-partitioned table

2016-05-26 Thread Sudarshan Lamkhede (JIRA)
Sudarshan Lamkhede created SPARK-15595:
--

 Summary: DataFrame.write.mode(SaveMode.Append).insertInto(TABLE) 
overwrites non-partitioned table
 Key: SPARK-15595
 URL: https://issues.apache.org/jira/browse/SPARK-15595
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: Sudarshan Lamkhede


See the examples below

scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS noparts (model_name STRING, 
dateint INT) STORED AS PARQUET""")
res0: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("""CREATE TABLE IF NOT EXISTS parts (model_name STRING) 
PARTITIONED BY (dateint INT) STORED AS PARQUET""")
res1: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
+--+---+


scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
+--+---+


scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> val df1 = sc.parallelize(Array(("before", 1)), 1).toDF("model_name", 
"dateint")
df1: org.apache.spark.sql.DataFrame = [model_name: string, dateint: int]

scala> val df2 = sc.parallelize(Array(("after", 2)), 1).toDF("model_name", 
"dateint")
df2: org.apache.spark.sql.DataFrame = [model_name: string, dateint: int]

scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode

scala> df1.write.mode(SaveMode.Append).insertInto("noparts")

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
+--+---+


scala> df2.write.mode(SaveMode.Append).insertInto("noparts")

scala> sqlContext.sql("select * from noparts").show()
+--+---+
|model_name|dateint|
+--+---+
| after|  2|
+--+---+


scala> df1.write.mode(SaveMode.Append).insertInto("parts")

scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
+--+---+


scala> df2.write.mode(SaveMode.Append).insertInto("parts")

scala> sqlContext.sql("select * from parts").show()
+--+---+
|model_name|dateint|
+--+---+
|before|  1|
| after|  2|
+--+---+





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15584:
--
Assignee: Dongjoon Hyun

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303276#comment-15303276
 ] 

Dongjoon Hyun commented on SPARK-15584:
---

Oh, sure! Thank you!

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Priority: Minor
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15594) ALTER TABLE ... SERDEPROPERTIES does not respect partition spec

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15594:


Assignee: Apache Spark  (was: Andrew Or)

> ALTER TABLE ... SERDEPROPERTIES does not respect partition spec
> ---
>
> Key: SPARK-15594
> URL: https://issues.apache.org/jira/browse/SPARK-15594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> {code}
> case class AlterTableSerDePropertiesCommand(
> tableName: TableIdentifier,
> serdeClassName: Option[String],
> serdeProperties: Option[Map[String, String]],
> partition: Option[Map[String, String]])
>   extends RunnableCommand {
> {code}
> The `partition` flag is not read anywhere!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15594) ALTER TABLE ... SERDEPROPERTIES does not respect partition spec

2016-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303271#comment-15303271
 ] 

Apache Spark commented on SPARK-15594:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/13343

> ALTER TABLE ... SERDEPROPERTIES does not respect partition spec
> ---
>
> Key: SPARK-15594
> URL: https://issues.apache.org/jira/browse/SPARK-15594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> case class AlterTableSerDePropertiesCommand(
> tableName: TableIdentifier,
> serdeClassName: Option[String],
> serdeProperties: Option[Map[String, String]],
> partition: Option[Map[String, String]])
>   extends RunnableCommand {
> {code}
> The `partition` flag is not read anywhere!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15594) ALTER TABLE ... SERDEPROPERTIES does not respect partition spec

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15594:


Assignee: Andrew Or  (was: Apache Spark)

> ALTER TABLE ... SERDEPROPERTIES does not respect partition spec
> ---
>
> Key: SPARK-15594
> URL: https://issues.apache.org/jira/browse/SPARK-15594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> case class AlterTableSerDePropertiesCommand(
> tableName: TableIdentifier,
> serdeClassName: Option[String],
> serdeProperties: Option[Map[String, String]],
> partition: Option[Map[String, String]])
>   extends RunnableCommand {
> {code}
> The `partition` flag is not read anywhere!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15594) ALTER TABLE ... SERDEPROPERTIES does not respect partition spec

2016-05-26 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15594:
-

 Summary: ALTER TABLE ... SERDEPROPERTIES does not respect 
partition spec
 Key: SPARK-15594
 URL: https://issues.apache.org/jira/browse/SPARK-15594
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


{code}
case class AlterTableSerDePropertiesCommand(
tableName: TableIdentifier,
serdeClassName: Option[String],
serdeProperties: Option[Map[String, String]],
partition: Option[Map[String, String]])
  extends RunnableCommand {
{code}
The `partition` flag is not read anywhere!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15593) Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery

2016-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303241#comment-15303241
 ] 

Apache Spark commented on SPARK-15593:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/13342

> Add DataFrameWriter.foreach to allow the user consuming data in 
> ContinuousQuery
> ---
>
> Key: SPARK-15593
> URL: https://issues.apache.org/jira/browse/SPARK-15593
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15593) Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15593:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Add DataFrameWriter.foreach to allow the user consuming data in 
> ContinuousQuery
> ---
>
> Key: SPARK-15593
> URL: https://issues.apache.org/jira/browse/SPARK-15593
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15593) Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15593:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Add DataFrameWriter.foreach to allow the user consuming data in 
> ContinuousQuery
> ---
>
> Key: SPARK-15593
> URL: https://issues.apache.org/jira/browse/SPARK-15593
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15593) Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery

2016-05-26 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-15593:


 Summary: Add DataFrameWriter.foreach to allow the user consuming 
data in ContinuousQuery
 Key: SPARK-15593
 URL: https://issues.apache.org/jira/browse/SPARK-15593
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15588) Paginate Stage Table in Stages tab, Job Table in Jobs tab, and Query Table in SQL tab

2016-05-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15588:
-
Summary: Paginate Stage Table in Stages tab, Job Table in Jobs tab, and 
Query Table in SQL tab  (was: Paginate Stage Table in Stages tab and Job Table 
in Jobs tab)

> Paginate Stage Table in Stages tab, Job Table in Jobs tab, and Query Table in 
> SQL tab
> -
>
> Key: SPARK-15588
> URL: https://issues.apache.org/jira/browse/SPARK-15588
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Assignee: Tao Lin
>
> Seems we do not paginate Stage Table in Stages tab and Job Table in Jobs tab. 
> We can use PagedTable to make StageTableBase (the class for Stage Table) 
> support pagination. For Job Table in Jobs tab, looks like we need to extract 
> jobsTable from AllJobsPage class and make it use PagedTable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15591) Paginate Stage Table in Stages tab

2016-05-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15591:
-
Assignee: Tao Lin

> Paginate Stage Table in Stages tab
> --
>
> Key: SPARK-15591
> URL: https://issues.apache.org/jira/browse/SPARK-15591
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Yin Huai
>Assignee: Tao Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15592) Paginate query table in SQL tab

2016-05-26 Thread Yin Huai (JIRA)
Yin Huai created SPARK-15592:


 Summary: Paginate query table in SQL tab
 Key: SPARK-15592
 URL: https://issues.apache.org/jira/browse/SPARK-15592
 Project: Spark
  Issue Type: Sub-task
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15592) Paginate query table in SQL tab

2016-05-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15592:
-
Assignee: Tao Lin

> Paginate query table in SQL tab
> ---
>
> Key: SPARK-15592
> URL: https://issues.apache.org/jira/browse/SPARK-15592
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Yin Huai
>Assignee: Tao Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15590) Paginate Job Table in Jobs tab

2016-05-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15590:
-
Assignee: Tao Lin

> Paginate Job Table in Jobs tab
> --
>
> Key: SPARK-15590
> URL: https://issues.apache.org/jira/browse/SPARK-15590
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Yin Huai
>Assignee: Tao Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15590) Paginate Job Table in Jobs tab

2016-05-26 Thread Yin Huai (JIRA)
Yin Huai created SPARK-15590:


 Summary: Paginate Job Table in Jobs tab
 Key: SPARK-15590
 URL: https://issues.apache.org/jira/browse/SPARK-15590
 Project: Spark
  Issue Type: Sub-task
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15591) Paginate Stage Table in Stages tab

2016-05-26 Thread Yin Huai (JIRA)
Yin Huai created SPARK-15591:


 Summary: Paginate Stage Table in Stages tab
 Key: SPARK-15591
 URL: https://issues.apache.org/jira/browse/SPARK-15591
 Project: Spark
  Issue Type: Sub-task
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15589) Anaylze simple PySpark closures and generate SQL expressions

2016-05-26 Thread holdenk (JIRA)
holdenk created SPARK-15589:
---

 Summary: Anaylze simple PySpark closures and generate SQL 
expressions
 Key: SPARK-15589
 URL: https://issues.apache.org/jira/browse/SPARK-15589
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: holdenk


Similar to SPARK-14083 we can try introspecting simple Python functions and see 
if we can generate an equivalent SQL expression. This would result in an even 
greater performance increase for PySpark users than Scala users as not only 
would they benefit from better codegen, it would also avoid substantial 
serialization cost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15583) Relax ALTER TABLE properties restriction for data source tables

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15583:


Assignee: Apache Spark  (was: Andrew Or)

> Relax ALTER TABLE properties restriction for data source tables
> ---
>
> Key: SPARK-15583
> URL: https://issues.apache.org/jira/browse/SPARK-15583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> Looks like right now we just don't support ALTER TABLE SET TBLPROPERTIES for 
> all properties. This is overly restrictive; as long as the user doesn't touch 
> anything in the special namespace (spark.sql.sources.*) then we're OK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15583) Relax ALTER TABLE properties restriction for data source tables

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15583:


Assignee: Andrew Or  (was: Apache Spark)

> Relax ALTER TABLE properties restriction for data source tables
> ---
>
> Key: SPARK-15583
> URL: https://issues.apache.org/jira/browse/SPARK-15583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Looks like right now we just don't support ALTER TABLE SET TBLPROPERTIES for 
> all properties. This is overly restrictive; as long as the user doesn't touch 
> anything in the special namespace (spark.sql.sources.*) then we're OK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15583) Relax ALTER TABLE properties restriction for data source tables

2016-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303231#comment-15303231
 ] 

Apache Spark commented on SPARK-15583:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/13341

> Relax ALTER TABLE properties restriction for data source tables
> ---
>
> Key: SPARK-15583
> URL: https://issues.apache.org/jira/browse/SPARK-15583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Looks like right now we just don't support ALTER TABLE SET TBLPROPERTIES for 
> all properties. This is overly restrictive; as long as the user doesn't touch 
> anything in the special namespace (spark.sql.sources.*) then we're OK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10903) Simplify SQLContext method signatures and use a singleton

2016-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303210#comment-15303210
 ] 

Apache Spark commented on SPARK-10903:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/13340

> Simplify SQLContext method signatures and use a singleton
> -
>
> Key: SPARK-10903
> URL: https://issues.apache.org/jira/browse/SPARK-10903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>
> Make sqlContext global so that we don't have to always specify it.
> e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2016-05-26 Thread Patrick Grandjean (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303205#comment-15303205
 ] 

Patrick Grandjean commented on SPARK-7768:
--

This seems promising:

https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UDTRegistration.scala


> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14608) transformSchema needs better documentation

2016-05-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14608:
--
Shepherd: Joseph K. Bradley
Assignee: yuhao yang

> transformSchema needs better documentation
> --
>
> Key: SPARK-14608
> URL: https://issues.apache.org/jira/browse/SPARK-14608
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
>
> {{PipelineStage.transformSchema}} currently has minimal documentation.  It 
> should have more to explain it can:
> * check schema
> * check parameter interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15532) SQLContext/HiveContext's public constructors should use SparkSession.build.getOrCreate

2016-05-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15532.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13310
[https://github.com/apache/spark/pull/13310]

> SQLContext/HiveContext's public constructors should use 
> SparkSession.build.getOrCreate
> --
>
> Key: SPARK-15532
> URL: https://issues.apache.org/jira/browse/SPARK-15532
> Project: Spark
>  Issue Type: Bug
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> SQLContext/HiveContext's public constructors should use 
> SparkSession.build.getOrCreate and we can remove SQLContext's isRootContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15588) Paginate Stage Table in Stages tab and Job Table in Jobs tab

2016-05-26 Thread Yin Huai (JIRA)
Yin Huai created SPARK-15588:


 Summary: Paginate Stage Table in Stages tab and Job Table in Jobs 
tab
 Key: SPARK-15588
 URL: https://issues.apache.org/jira/browse/SPARK-15588
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Yin Huai
Assignee: Tao Lin


Seems we do not paginate Stage Table in Stages tab and Job Table in Jobs tab. 
We can use PagedTable to make StageTableBase (the class for Stage Table) 
support pagination. For Job Table in Jobs tab, looks like we need to extract 
jobsTable from AllJobsPage class and make it use PagedTable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15099) Audit: ml.regression

2016-05-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303173#comment-15303173
 ] 

Joseph K. Bradley commented on SPARK-15099:
---

Have you been able to do the audit yet?

> Audit: ml.regression
> 
>
> Key: SPARK-15099
> URL: https://issues.apache.org/jira/browse/SPARK-15099
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Audit this sub-package for new algorithms which do not have corresponding 
> sections & examples in the user guide.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15098) Audit: ml.classification

2016-05-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-15098.
---
   Resolution: Done
 Assignee: yuhao yang
Fix Version/s: 2.0.0

I'll mark this done given that [~yuhaoyan] did a pass.  If anyone finds an 
issue, feel free to reopen the JIRA.

> Audit: ml.classification
> 
>
> Key: SPARK-15098
> URL: https://issues.apache.org/jira/browse/SPARK-15098
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
> Fix For: 2.0.0
>
>
> Audit this sub-package for new algorithms which do not have corresponding 
> sections & examples in the user guide.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14813) ML 2.0 QA: API: Python API coverage

2016-05-26 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303169#comment-15303169
 ] 

holdenk commented on SPARK-14813:
-

I'd really like to split it up - but haven't heard back from [~yanboliang].

> ML 2.0 QA: API: Python API coverage
> ---
>
> Key: SPARK-14813
> URL: https://issues.apache.org/jira/browse/SPARK-14813
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: holdenk
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> Please use a *separate* JIRA (linked below as "requires") for this list of 
> to-do items.
> ** *NOTE: These missing features should be added in the next release.  This 
> work is just to generate a list of to-do items for the future.*
> UPDATE: This only needs to cover spark.ml since spark.mllib is going into 
> maintenance mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14815) ML, Graph, R 2.0 QA: Update user guide for new features & APIs

2016-05-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303168#comment-15303168
 ] 

Joseph K. Bradley commented on SPARK-14815:
---

I pretty much agree.  I'd be OK with deprecating them for removal in a future 
release.

> ML, Graph, R 2.0 QA: Update user guide for new features & APIs
> --
>
> Key: SPARK-14815
> URL: https://issues.apache.org/jira/browse/SPARK-14815
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide; 
> that will be under [SPARK-14817].
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work (which should be broken into pieces for 
> this larger 2.0 release).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14813) ML 2.0 QA: API: Python API coverage

2016-05-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303166#comment-15303166
 ] 

Joseph K. Bradley commented on SPARK-14813:
---

[~holdenk] Are you auditing all of PySpark yourself?  If so, can you please 
confirm when you are done with your audit?  If not, then let's create subtasks 
to track audits of different submodules.  Thanks!

> ML 2.0 QA: API: Python API coverage
> ---
>
> Key: SPARK-14813
> URL: https://issues.apache.org/jira/browse/SPARK-14813
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: holdenk
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> Please use a *separate* JIRA (linked below as "requires") for this list of 
> to-do items.
> ** *NOTE: These missing features should be added in the next release.  This 
> work is just to generate a list of to-do items for the future.*
> UPDATE: This only needs to cover spark.ml since spark.mllib is going into 
> maintenance mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15587) ML 2.0 QA: Scala APIs audit for feature

2016-05-26 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-15587:
-

 Summary: ML 2.0 QA: Scala APIs audit for feature
 Key: SPARK-15587
 URL: https://issues.apache.org/jira/browse/SPARK-15587
 Project: Spark
  Issue Type: Task
  Components: ML
Reporter: Joseph K. Bradley


See containing JIRA for details: [SPARK-14811]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-26 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303160#comment-15303160
 ] 

Takeshi Yamamuro commented on SPARK-15585:
--

yea, If no problem, I'll take this.

> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15586) ML 2.0 QA: Scala APIs audit for evaluation, tuning

2016-05-26 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-15586:
-

 Summary: ML 2.0 QA: Scala APIs audit for evaluation, tuning
 Key: SPARK-15586
 URL: https://issues.apache.org/jira/browse/SPARK-15586
 Project: Spark
  Issue Type: Task
  Components: ML
Reporter: Joseph K. Bradley


See containing JIRA for details: [SPARK-14811]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15550) Dataset.show() doesn't disply inner nested structs properly

2016-05-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-15550.

Resolution: Fixed

Issue resolved by pull request 13331
[https://github.com/apache/spark/pull/13331]

> Dataset.show() doesn't disply inner nested structs properly
> ---
>
> Key: SPARK-15550
> URL: https://issues.apache.org/jira/browse/SPARK-15550
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Say we have the following nested case class:
> {code}
> case class ClassData(a: String, b: Int)
> case class NestedStruct(f: ClassData)
> {code}
> For a Dataset {{ds}} of {{NestedStruct}}, {{ds.show()}} should convert all 
> case class instances, including the inner nested {{ClassData}}, into {{Row}} 
> instances before displaying them. However, {{ClassData}} instances are just 
> displayed using {{toString}}.
> {code}
> val data = Seq(
>   "{'f': {'b': 1, 'a': 'foo'}}",
>   "{'f': {'b': 2, 'a': 'bar'}}"
> )
> val df = spark.read.json(sc.parallelize(data))
> val ds = df.as[NestedStruct]
> {code}
> Actual output:
> {noformat}
> ++
> |   f|
> ++
> |ClassData(foo,1)|
> |ClassData(bar,2)|
> ++
> {noformat}
> Expected output:
> {noformat}
> +---+
> |  f|
> +---+
> |[1,foo]|
> |[2,bar]|
> +---+
> {noformat}
> This is not too big a deal for Scala users since Scala case classes always 
> come with a well defined default {{toString}} method. But Java beans don't.
> Another point is that, Dataset is just a view of the underlying logical plan, 
> and the domain object type may not refer to all fields defined in the 
> underlying logical plan. However, users are still allowed to access these 
> extra fields using methods like {{Dataset.col}}. Due to this consideration, 
> we decided to let {{Dataset.show()}} directly delegate to 
> {{Dataset.toDF().show()}}, which shows all fields defined in the logical 
> plan. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303142#comment-15303142
 ] 

Reynold Xin commented on SPARK-15585:
-

cc [~maropu] interested in doing this?


> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303144#comment-15303144
 ] 

Reynold Xin commented on SPARK-15585:
-

cc [~shivaram] / [~sunrui] / [~felixcheung] would this impact R?

> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15585:

Description: 
See email: 
http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html

We'd need to change DataFrameReader/DataFrameWriter in Python's 
csv/json/parquet/... functions to put the actual default option values as 
function parameters, rather than setting them to None. We can then in 
CSVOptions.getChar (and JSONOptions, etc) to actually return null if the value 
is null, rather  than setting it to default value.


  was:
See email: 
http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html

We'd need to change DataFrameReader/DataFrameWriter in Python's 
csv/json/parquet/... functions to put the actual default option values as 
function parameters, rather than setting them to None.



> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-15585:
---

 Summary: Don't use null in data source options to indicate default 
value
 Key: SPARK-15585
 URL: https://issues.apache.org/jira/browse/SPARK-15585
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Priority: Critical


See email: 
http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html

We'd need to change DataFrameReader/DataFrameWriter in Python's 
csv/json/parquet/... functions to put the actual default option values as 
function parameters, rather than setting them to None.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8428) TimSort Comparison method violates its general contract with CLUSTER BY

2016-05-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8428.

   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.2

> TimSort Comparison method violates its general contract with CLUSTER BY
> ---
>
> Key: SPARK-8428
> URL: https://issues.apache.org/jira/browse/SPARK-8428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Oracle Java 7 
>Reporter: Nathan McCarthy
>Assignee: Sameer Agarwal
> Fix For: 1.6.2, 2.0.0
>
>
> Running an SQL query that has a sub query and multiple left joins fails when 
> there is a CLUSTER BY (which implies a sortBy). This gives the following 
> stack trace; 
> {code}
> Job aborted due to stage failure: Task 118 in stage 4.0 failed 4 times, most 
> recent failure: Lost task 118.3 in stage 4.0 (TID 18392, node142): 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeHi(TimSort.java:900)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:509)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:435)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:307)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:135)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.PartitionedPairBuffer.partitionedDestructiveSortedIterator(PartitionedPairBuffer.scala:70)
>   at 
> org.apache.spark.util.collection.ExternalSorter.partitionedIterator(ExternalSorter.scala:690)
>   at 
> org.apache.spark.util.collection.ExternalSorter.iterator(ExternalSorter.scala:708)
>   at 
> org.apache.spark.sql.execution.ExternalSort$$anonfun$doExecute$6$$anonfun$apply$7.apply(basicOperators.scala:222)
>   at 
> org.apache.spark.sql.execution.ExternalSort$$anonfun$doExecute$6$$anonfun$apply$7.apply(basicOperators.scala:218)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> {code}
> The query looks like;
> {code}
>  val df = sqlContext.sql("""SELECT CID
> |,PW_END_DATE
> |,PROD_NBR_KEY
> |,SUM(CASE WHEN SUBST_IDX = 1 THEN L13W_SALE END) AS SUB1_L13W_SALE
> |FROM
> |(SELECT  BASE.CID
> |,BASE.PW_END_DATE
> |,BASE.PROD_NBR_KEY
> |,SUBN.SUBST_IDX
> |,CASE WHEN IDX.PW_END_DATE BETWEEN DATE_SUB(BASE.PW_END_DATE, 13*7 - 1) 
> AND BASE.PW_END_DATE THEN IDX.TOT_AMT_INCLD_GST END AS L13W_SALE
> |FROM TESTBASE BASE
> |LEFT JOIN TABLX SUBN
> |ON BASE.PROD_NBR_KEY = SUBN.PROD_NBR_KEY AND SUBN.SUBST_IDX <= 3
> |LEFT JOIN TABLEF IDX
> |ON BASE.CRN = IDX.CRN
> |AND SUBN.CROSS_PROD_NBR = IDX.PROD_NBR_KEY
> |) SUBSPREM
> | GROUP BY CRN, PW_END_DATE, PROD_NBR_KEY""".stripMargin)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13723) YARN - Change behavior of --num-executors when spark.dynamicAllocation.enabled true

2016-05-26 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303103#comment-15303103
 ] 

Ryan Blue commented on SPARK-13723:
---

I'm porting our changes forward to the 2.0.0 preview so I opened a PR that 
implements this. I highly recommend changing this default.

> YARN - Change behavior of --num-executors when 
> spark.dynamicAllocation.enabled true
> ---
>
> Key: SPARK-13723
> URL: https://issues.apache.org/jira/browse/SPARK-13723
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Minor
>
> I think we should change the behavior when --num-executors is specified when 
> dynamic allocation is enabled. Currently if --num-executors is specified 
> dynamic allocation is disabled and it just uses a static number of executors.
> I would rather see the default behavior changed in the 2.x line. If dynamic 
> allocation config is on then num-executors goes to max and initial # of 
> executors. I think this would allow users to easily cap their usage and would 
> still allow it to free up executors. It would also allow users doing ML start 
> out with a # of executors and if they are actually caching the data the 
> executors wouldn't be freed up. So you would get very similar behavior to if 
> dynamic allocation was off.
> Part of the reason for this is when using a static number if generally wastes 
> resources, especially with people doing adhoc things with spark-shell. It 
> also has a big affect when people are doing MapReduce/ETL type work loads.   
> The problem is that people are used to specifying num-executors so if we turn 
> it on by default in a cluster config its just overridden.
> We should also update the spark-submit --help description for --num-executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13723) YARN - Change behavior of --num-executors when spark.dynamicAllocation.enabled true

2016-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303102#comment-15303102
 ] 

Apache Spark commented on SPARK-13723:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/13338

> YARN - Change behavior of --num-executors when 
> spark.dynamicAllocation.enabled true
> ---
>
> Key: SPARK-13723
> URL: https://issues.apache.org/jira/browse/SPARK-13723
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Minor
>
> I think we should change the behavior when --num-executors is specified when 
> dynamic allocation is enabled. Currently if --num-executors is specified 
> dynamic allocation is disabled and it just uses a static number of executors.
> I would rather see the default behavior changed in the 2.x line. If dynamic 
> allocation config is on then num-executors goes to max and initial # of 
> executors. I think this would allow users to easily cap their usage and would 
> still allow it to free up executors. It would also allow users doing ML start 
> out with a # of executors and if they are actually caching the data the 
> executors wouldn't be freed up. So you would get very similar behavior to if 
> dynamic allocation was off.
> Part of the reason for this is when using a static number if generally wastes 
> resources, especially with people doing adhoc things with spark-shell. It 
> also has a big affect when people are doing MapReduce/ETL type work loads.   
> The problem is that people are used to specifying num-executors so if we turn 
> it on by default in a cluster config its just overridden.
> We should also update the spark-submit --help description for --num-executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13723) YARN - Change behavior of --num-executors when spark.dynamicAllocation.enabled true

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13723:


Assignee: (was: Apache Spark)

> YARN - Change behavior of --num-executors when 
> spark.dynamicAllocation.enabled true
> ---
>
> Key: SPARK-13723
> URL: https://issues.apache.org/jira/browse/SPARK-13723
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Minor
>
> I think we should change the behavior when --num-executors is specified when 
> dynamic allocation is enabled. Currently if --num-executors is specified 
> dynamic allocation is disabled and it just uses a static number of executors.
> I would rather see the default behavior changed in the 2.x line. If dynamic 
> allocation config is on then num-executors goes to max and initial # of 
> executors. I think this would allow users to easily cap their usage and would 
> still allow it to free up executors. It would also allow users doing ML start 
> out with a # of executors and if they are actually caching the data the 
> executors wouldn't be freed up. So you would get very similar behavior to if 
> dynamic allocation was off.
> Part of the reason for this is when using a static number if generally wastes 
> resources, especially with people doing adhoc things with spark-shell. It 
> also has a big affect when people are doing MapReduce/ETL type work loads.   
> The problem is that people are used to specifying num-executors so if we turn 
> it on by default in a cluster config its just overridden.
> We should also update the spark-submit --help description for --num-executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13723) YARN - Change behavior of --num-executors when spark.dynamicAllocation.enabled true

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13723:


Assignee: Apache Spark

> YARN - Change behavior of --num-executors when 
> spark.dynamicAllocation.enabled true
> ---
>
> Key: SPARK-13723
> URL: https://issues.apache.org/jira/browse/SPARK-13723
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Minor
>
> I think we should change the behavior when --num-executors is specified when 
> dynamic allocation is enabled. Currently if --num-executors is specified 
> dynamic allocation is disabled and it just uses a static number of executors.
> I would rather see the default behavior changed in the 2.x line. If dynamic 
> allocation config is on then num-executors goes to max and initial # of 
> executors. I think this would allow users to easily cap their usage and would 
> still allow it to free up executors. It would also allow users doing ML start 
> out with a # of executors and if they are actually caching the data the 
> executors wouldn't be freed up. So you would get very similar behavior to if 
> dynamic allocation was off.
> Part of the reason for this is when using a static number if generally wastes 
> resources, especially with people doing adhoc things with spark-shell. It 
> also has a big affect when people are doing MapReduce/ETL type work loads.   
> The problem is that people are used to specifying num-executors so if we turn 
> it on by default in a cluster config its just overridden.
> We should also update the spark-submit --help description for --num-executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13108) Encoding not working with non-ascii compatible encodings (UTF-16/32 etc.)

2016-05-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-13108.
---
Resolution: Later

Closing as later. For more information, see 
https://github.com/apache/spark/pull/11016

> Encoding not working with non-ascii compatible encodings (UTF-16/32 etc.)
> -
>
> Key: SPARK-13108
> URL: https://issues.apache.org/jira/browse/SPARK-13108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This library uses Hadoop's 
> [{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java],
>  which uses 
> [{{LineRecordReader}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java].
> According to 
> [MAPREDUCE-232|https://issues.apache.org/jira/browse/MAPREDUCE-232], it looks 
> [{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java]
>  does not guarantee all encoding types but officially only UTF-8 (as 
> commented in 
> [{{LineRecordReader#L147}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L147]).
> According to 
> [MAPREDUCE-232#comment-13183601|https://issues.apache.org/jira/browse/MAPREDUCE-232?focusedCommentId=13183601=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13183601],
>  it still looks fine with most encodings though but without UTF-16/32.
> In more details, 
> I tested this in Max OS. I converted `cars_iso-8859-1.csv` into 
> `cars_utf-16.csv` as below:
> {code}
> iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv
> {code}
> and run the codes below:
> {code}
> val cars = "cars_utf-16.csv"
> sqlContext.read
>   .format("csv")
>   .option("charset", "utf-16")
>   .option("delimiter", 'þ')
>   .load(cars)
>   .show()
> {code}
> This produces a wrong results below:
> {code}
> ++-+-++--+
> |year| make|model| comment|blank�|
> ++-+-++--+
> |2012|Tesla|S|  No comment| �|
> |   �| null| null|null|  null|
> |1997| Ford| E350|Go get one now th...| �|
> |2015|Chevy|Volt�|null|  null|
> |   �| null| null|null|  null|
> ++-+-++--+
> {code}
> Instead of the correct results below:
> {code}
> ++-+-++-+
> |year| make|model| comment|blank|
> ++-+-++-+
> |2012|Tesla|S|  No comment| |
> |1997| Ford| E350|Go get one now th...| |
> |2015|Chevy| Volt|null| null|
> ++-+-++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15529) Replace SQLContext and HiveContext with SparkSession in Test

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15529:


Assignee: (was: Apache Spark)

> Replace SQLContext and HiveContext with SparkSession in Test
> 
>
> Key: SPARK-15529
> URL: https://issues.apache.org/jira/browse/SPARK-15529
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Use the latest Sparksession to replace the existing SQLContext and 
> HiveContext in test cases.
> No change will be made in the following suites:
> {{listTablesSuite}} is to test the APIs of {{SQLContext}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15529) Replace SQLContext and HiveContext with SparkSession in Test

2016-05-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15529:


Assignee: Apache Spark

> Replace SQLContext and HiveContext with SparkSession in Test
> 
>
> Key: SPARK-15529
> URL: https://issues.apache.org/jira/browse/SPARK-15529
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Use the latest Sparksession to replace the existing SQLContext and 
> HiveContext in test cases.
> No change will be made in the following suites:
> {{listTablesSuite}} is to test the APIs of {{SQLContext}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15529) Replace SQLContext and HiveContext with SparkSession in Test

2016-05-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303086#comment-15303086
 ] 

Apache Spark commented on SPARK-15529:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/13337

> Replace SQLContext and HiveContext with SparkSession in Test
> 
>
> Key: SPARK-15529
> URL: https://issues.apache.org/jira/browse/SPARK-15529
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Use the latest Sparksession to replace the existing SQLContext and 
> HiveContext in test cases.
> No change will be made in the following suites:
> {{listTablesSuite}} is to test the APIs of {{SQLContext}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15584:
-

 Summary: Abstract duplicate code: "spark.sql.sources." properties
 Key: SPARK-15584
 URL: https://issues.apache.org/jira/browse/SPARK-15584
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
etc. everywhere. If we mistype something then things will silently fail. This 
is pretty brittle. It would better if we have static variables that we can 
reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15584:
--
Issue Type: Improvement  (was: Bug)

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303054#comment-15303054
 ] 

Andrew Or commented on SPARK-15584:
---

[~dongjoon] would you like to work on this?

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15584:
--
Assignee: (was: Andrew Or)

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15584:
--
Priority: Minor  (was: Major)

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Priority: Minor
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15582) Support for Groovy closures

2016-05-26 Thread Catalin Alexandru Zamfir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303030#comment-15303030
 ] 

Catalin Alexandru Zamfir edited comment on SPARK-15582 at 5/26/16 10:08 PM:


Nope. Well, I'm trying to understand how/if that's possible, if a solution 
exists or if it doesn't fit the picture alltogheter. If Java 8 Lamdas work, 
then closures which are based on the same SAM principle, should also work, no? 
SPARK-2171 advertises that Groovy closures work, but most probably the tests 
were done in local[] mode, where the classes do exist.

I see the Groovy script executing the first stage (cache) which only invokes 
Java-code, but when the second step needs executing (with the flatMap closure 
written in Groovy) ... that doesn't translate to something the executors can 
understand/find.

Solutions of sending the text/byte-code around to make the inner classes of the 
script visible to the executors at run-time are also what I'm thinking, but I 
see them as work-arounds rather than first-class citizens of the framework. I'd 
like to help build this support if it does not exist in Spark, or at least 
document how it's possible to make them work.

For this however I need some guidance on where to look, what to hack at to try 
to make it work :) ...


was (Author: antauri):
Nope. Well, I'm trying to understand how/if that's possible, if a solution 
exists or if it doesn't fit the picture alltogheter. If Java 8 Lamdas work, 
then closures which are based on the same SAM principle, should also work, no? 

I see the Groovy script executing the first stage (cache) which only invokes 
Java-code, but when the second step needs executing (with the flatMap closure 
written in Groovy) ... that doesn't translate to something the executors can 
understand/find.

Solutions of sending the text/byte-code around to make the inner classes of the 
script visible to the executors at run-time are also what I'm thinking, but I 
see them as work-arounds rather than first-class citizens of the framework. I'd 
like to help build this support if it does not exist in Spark, or at least 
document how it's possible to make them work.

For this however I need some guidance on where to look, what to hack at to try 
to make it work :) ...

> Support for Groovy closures
> ---
>
> Key: SPARK-15582
> URL: https://issues.apache.org/jira/browse/SPARK-15582
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
> Environment: 6 node Debian 8 based Spark cluster
>Reporter: Catalin Alexandru Zamfir
>
> After fixing SPARK-13599 and running one of our jobs against this fix for 
> Groovy dependencies (which indeed it fixed), we see the Spark executors stuck 
> at a ClassNotFound exception when running as a Script (via 
> GroovyShell.evalute (scriptText)). It seems Spark cannot de-serialize the 
> closure, or the closure is not received by the executor.
> {noformat}
> sparkContext.binaryFiles (ourPath).flatMap ({ onePathEntry -> code-block } as 
> FlatMapFunction).count ();
> { onePathEntry -> code-block } denotes a Groovy closure.
> {noformat}
> There is a groovy-spark example @ 
> https://github.com/bunions1/groovy-spark-example ... However the above uses a 
> modified Groovy. If my understanding is correct, Groovy compiles to native 
> byte-code, which should be easy for Spark to pick-up and use closures.
> The above example code fails with this stack-trace:
> {noformat}
> Caused by: java.lang.ClassNotFoundException: Script1$_run_closure1
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at 

[jira] [Comment Edited] (SPARK-15582) Support for Groovy closures

2016-05-26 Thread Catalin Alexandru Zamfir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303030#comment-15303030
 ] 

Catalin Alexandru Zamfir edited comment on SPARK-15582 at 5/26/16 10:04 PM:


Nope. Well, I'm trying to understand how/if that's possible, if a solution 
exists or if it doesn't fit the picture alltogheter. If Java 8 Lamdas work, 
then closures which are based on the same SAM principle, should also work, no? 

I see the Groovy script executing the first stage (cache) which only invokes 
Java-code, but when the second step needs executing (with the flatMap closure 
written in Groovy) ... that doesn't translate to something the executors can 
understand/find.

Solutions of sending the text/byte-code around to make the inner classes of the 
script visible to the executors at run-time are also what I'm thinking, but I 
see them as work-arounds rather than first-class citizens of the framework. I'd 
like to help build this support if it does not exist in Spark, or at least 
document how it's possible to make them work.

For this however I need some guidance on where to look, what to hack at to try 
to make it work :) ...


was (Author: antauri):
Nope. Well, I'm trying to understand how/if that's possible, if a solution 
exists or if it doesn't fit the picture alltogheter. If Java 8 Lamdas work, 
then closures which are based on the same SAM principle, should also work, no? 

I see the Groovy script executing the first stage (cache) which only invokes 
Java-code, but when the second step needs executing (with the flatMap closure 
written in Groovy) ... that doesn't translate to something the executors can 
understand/find.

Solutions of sending the text/byte-code around to make the inner classes of the 
script visible to the executors at run-time are also what I'm thinking, but I 
see them as work-arounds rather than first-class citizens of the framework. I'd 
like to help build this support if it does not exist in Spark, or at least 
document how it's possible to make them work.

> Support for Groovy closures
> ---
>
> Key: SPARK-15582
> URL: https://issues.apache.org/jira/browse/SPARK-15582
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
> Environment: 6 node Debian 8 based Spark cluster
>Reporter: Catalin Alexandru Zamfir
>
> After fixing SPARK-13599 and running one of our jobs against this fix for 
> Groovy dependencies (which indeed it fixed), we see the Spark executors stuck 
> at a ClassNotFound exception when running as a Script (via 
> GroovyShell.evalute (scriptText)). It seems Spark cannot de-serialize the 
> closure, or the closure is not received by the executor.
> {noformat}
> sparkContext.binaryFiles (ourPath).flatMap ({ onePathEntry -> code-block } as 
> FlatMapFunction).count ();
> { onePathEntry -> code-block } denotes a Groovy closure.
> {noformat}
> There is a groovy-spark example @ 
> https://github.com/bunions1/groovy-spark-example ... However the above uses a 
> modified Groovy. If my understanding is correct, Groovy compiles to native 
> byte-code, which should be easy for Spark to pick-up and use closures.
> The above example code fails with this stack-trace:
> {noformat}
> Caused by: java.lang.ClassNotFoundException: Script1$_run_closure1
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> 

[jira] [Commented] (SPARK-15582) Support for Groovy closures

2016-05-26 Thread Catalin Alexandru Zamfir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303030#comment-15303030
 ] 

Catalin Alexandru Zamfir commented on SPARK-15582:
--

Nope. Well, I'm trying to understand how/if that's possible, if a solution 
exists or if it doesn't fit the picture alltogheter. If Java 8 Lamdas work, 
then closures which are based on the same SAM principle, should also work, no? 

I see the Groovy script executing the first stage (cache) which only invokes 
Java-code, but when the second step needs executing (with the flatMap closure 
written in Groovy) ... that doesn't translate to something the executors can 
understand/find.

Solutions of sending the text/byte-code around to make the inner classes of the 
script visible to the executors at run-time are also what I'm thinking, but I 
see them as work-arounds rather than first-class citizens of the framework. I'd 
like to help build this support if it does not exist in Spark, or at least 
document how it's possible to make them work.

> Support for Groovy closures
> ---
>
> Key: SPARK-15582
> URL: https://issues.apache.org/jira/browse/SPARK-15582
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
> Environment: 6 node Debian 8 based Spark cluster
>Reporter: Catalin Alexandru Zamfir
>
> After fixing SPARK-13599 and running one of our jobs against this fix for 
> Groovy dependencies (which indeed it fixed), we see the Spark executors stuck 
> at a ClassNotFound exception when running as a Script (via 
> GroovyShell.evalute (scriptText)). It seems Spark cannot de-serialize the 
> closure, or the closure is not received by the executor.
> {noformat}
> sparkContext.binaryFiles (ourPath).flatMap ({ onePathEntry -> code-block } as 
> FlatMapFunction).count ();
> { onePathEntry -> code-block } denotes a Groovy closure.
> {noformat}
> There is a groovy-spark example @ 
> https://github.com/bunions1/groovy-spark-example ... However the above uses a 
> modified Groovy. If my understanding is correct, Groovy compiles to native 
> byte-code, which should be easy for Spark to pick-up and use closures.
> The above example code fails with this stack-trace:
> {noformat}
> Caused by: java.lang.ClassNotFoundException: Script1$_run_closure1
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> 

[jira] [Updated] (SPARK-15532) SQLContext/HiveContext's public constructors should use SparkSession.build.getOrCreate

2016-05-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15532:
-
Description: SQLContext/HiveContext's public constructors should use 
SparkSession.build.getOrCreate and we can remove SQLContext's isRootContext.  
(was: 
https://github.com/apache/spark/commit/f2ee0ed4b7ecb2855cc4928a9613a07d45446f4e#diff-131c27c6a1f59770d738b11f2a4755ecL87
 removed SQLConf.ALLOW_MULTIPLE_CONTEXTS. The check associated with this flag 
is useful when a deployment of Spark does not allow users to manually create 
SQLContext. So, let's add this flag and its associated check back.)

> SQLContext/HiveContext's public constructors should use 
> SparkSession.build.getOrCreate
> --
>
> Key: SPARK-15532
> URL: https://issues.apache.org/jira/browse/SPARK-15532
> Project: Spark
>  Issue Type: Bug
>Reporter: Yin Huai
>
> SQLContext/HiveContext's public constructors should use 
> SparkSession.build.getOrCreate and we can remove SQLContext's isRootContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >