[jira] [Created] (SPARK-17307) Document what all access is needed on S3 bucket when trying to save a model

2016-08-29 Thread Aseem Bansal (JIRA)
Aseem Bansal created SPARK-17307:


 Summary: Document what all access is needed on S3 bucket when 
trying to save a model
 Key: SPARK-17307
 URL: https://issues.apache.org/jira/browse/SPARK-17307
 Project: Spark
  Issue Type: Documentation
Reporter: Aseem Bansal


I faced this lack of documentation when I was trying to save a model to S3. 
Initially I thought it should be only write. Then I found it also needs delete 
to delete temporary files. Now I requested access for delete and tried again 
and I am get the error

Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: 
org.jets3t.service.S3ServiceException: S3 PUT failed for '/dev-qa_%24folder%24' 
XML Error Message

To reproduce this error the below can be used

{code}
SparkSession sparkSession = SparkSession
.builder()
.appName("my app")
.master("local") 
.getOrCreate();

JavaSparkContext jsc = new 
JavaSparkContext(sparkSession.sparkContext());

jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", );
jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", );

//Create a Pipelinemode


pipelineModel.write().overwrite().save("s3n:///dev-qa/modelTest");
{code}

This back and forth could be avoided if it was clearly mentioned what all 
access spark needs to write to S3. Also would be great if why all of the access 
is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17290) Spark CSVInferSchema does not always respect nullValue settings

2016-08-29 Thread Teng Yutong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447793#comment-15447793
 ] 

Teng Yutong commented on SPARK-17290:
-

So this is an issueSorry for the duplication. Should I just close this one?

> Spark CSVInferSchema does not always respect nullValue settings
> ---
>
> Key: SPARK-17290
> URL: https://issues.apache.org/jira/browse/SPARK-17290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Teng Yutong
>
> When loading a csv-formated data file into a table which has boolean type 
> column, if the boolean value is not given and the nullValue has been set, 
> CSVInferSchema will fail to parse the data.
> e.g.: 
> table schema:  create table test(id varchar(10),  flag boolean) USING 
> com.databricks.spark.csv OPTIONS (path "test.csv", header "false", nullValue 
> '') 
> csv data example:
> aa,
> bb,true
> cc,false
> After some investigation, I found that CSVInferSchema will not check wether 
> the current string match the nullValue or not if the target data type is 
> Boolean、Timestamp、Date。
> I am wondering that this logic is coded by purpose or not



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-08-29 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447791#comment-15447791
 ] 

Sun Rui commented on SPARK-13525:
-

yes, if spark fails to launch R worker as a process, it should throw 
IOException. So throwing SocketTimeoutException implies that the R worker 
process has been successfully launched. But your experiment shows that daemon.R 
seems not having a chance to be executed by RScript. So I guess something must 
be wrong after the Rscript is launched but before it begins to interpret the R 
script in daemon.R.

Could you write a shell wrapper for Rscript? Setting the configure option to 
point to it. This wrapper writes some debug information to a temp file, and in 
turn launches Rscript. Remember to re-direct the stdout and stderr of Rscript 
for debugging. 

> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:base’:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In FUN(X[[i]], ...) :
>   Use Petal_Width instead of Petal.Width  as column name
> > training <- filter(df, df$Species != "setosa")
> Error in filter(df, df$Species != "setosa") : 
>   no method for coercing this S4 class to a vector
> > training <- SparkR::filter(df, df$Species != "setosa")
> > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> > family = "binomial")
> 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> 

[jira] [Commented] (SPARK-17290) Spark CSVInferSchema does not always respect nullValue settings

2016-08-29 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447775#comment-15447775
 ] 

Liwei Lin commented on SPARK-17290:
---

Oh [~hyukjin.kwon] you are so devoted :-D

> Spark CSVInferSchema does not always respect nullValue settings
> ---
>
> Key: SPARK-17290
> URL: https://issues.apache.org/jira/browse/SPARK-17290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Teng Yutong
>
> When loading a csv-formated data file into a table which has boolean type 
> column, if the boolean value is not given and the nullValue has been set, 
> CSVInferSchema will fail to parse the data.
> e.g.: 
> table schema:  create table test(id varchar(10),  flag boolean) USING 
> com.databricks.spark.csv OPTIONS (path "test.csv", header "false", nullValue 
> '') 
> csv data example:
> aa,
> bb,true
> cc,false
> After some investigation, I found that CSVInferSchema will not check wether 
> the current string match the nullValue or not if the target data type is 
> Boolean、Timestamp、Date。
> I am wondering that this logic is coded by purpose or not



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17290) Spark CSVInferSchema does not always respect nullValue settings

2016-08-29 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447758#comment-15447758
 ] 

Hyukjin Kwon commented on SPARK-17290:
--

BTW, there is a related PR here, https://github.com/apache/spark/pull/14118

> Spark CSVInferSchema does not always respect nullValue settings
> ---
>
> Key: SPARK-17290
> URL: https://issues.apache.org/jira/browse/SPARK-17290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Teng Yutong
>
> When loading a csv-formated data file into a table which has boolean type 
> column, if the boolean value is not given and the nullValue has been set, 
> CSVInferSchema will fail to parse the data.
> e.g.: 
> table schema:  create table test(id varchar(10),  flag boolean) USING 
> com.databricks.spark.csv OPTIONS (path "test.csv", header "false", nullValue 
> '') 
> csv data example:
> aa,
> bb,true
> cc,false
> After some investigation, I found that CSVInferSchema will not check wether 
> the current string match the nullValue or not if the target data type is 
> Boolean、Timestamp、Date。
> I am wondering that this logic is coded by purpose or not



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17290) Spark CSVInferSchema does not always respect nullValue settings

2016-08-29 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447755#comment-15447755
 ] 

Hyukjin Kwon commented on SPARK-17290:
--

This should be a duplicate of SPARK-16462, SPARK-16460, SPARK-15144 and 
SPARK-16903

> Spark CSVInferSchema does not always respect nullValue settings
> ---
>
> Key: SPARK-17290
> URL: https://issues.apache.org/jira/browse/SPARK-17290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Teng Yutong
>
> When loading a csv-formated data file into a table which has boolean type 
> column, if the boolean value is not given and the nullValue has been set, 
> CSVInferSchema will fail to parse the data.
> e.g.: 
> table schema:  create table test(id varchar(10),  flag boolean) USING 
> com.databricks.spark.csv OPTIONS (path "test.csv", header "false", nullValue 
> '') 
> csv data example:
> aa,
> bb,true
> cc,false
> After some investigation, I found that CSVInferSchema will not check wether 
> the current string match the nullValue or not if the target data type is 
> Boolean、Timestamp、Date。
> I am wondering that this logic is coded by purpose or not



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13573) Open SparkR APIs (R package) to allow better 3rd party usage

2016-08-29 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447710#comment-15447710
 ] 

Sun Rui commented on SPARK-13573:
-

[~chipsenkbeil], we have made public the method for creating Java objects and 
invoking object methods: sparkR.callJMethod(), sparkR.callJStatic(), 
sparkR.newJObject().  please refer to SPARK-16581 and 
https://github.com/apache/spark/blob/master/R/pkg/R/jvm.R 

> Open SparkR APIs (R package) to allow better 3rd party usage
> 
>
> Key: SPARK-13573
> URL: https://issues.apache.org/jira/browse/SPARK-13573
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Chip Senkbeil
>
> Currently, SparkR's R package does not expose enough of its APIs to be used 
> flexibly. That I am aware of, SparkR still requires you to create a new 
> SparkContext by invoking the sparkR.init method (so you cannot connect to a 
> running one) and there is no way to invoke custom Java methods using the 
> exposed SparkR API (unlike PySpark).
> We currently maintain a fork of SparkR that is used to power the R 
> implementation of Apache Toree, which is a gateway to use Apache Spark. This 
> fork provides a connect method (to use an existing Spark Context), exposes 
> needed methods like invokeJava (to be able to communicate with our JVM to 
> retrieve code to run, etc), and uses reflection to access 
> org.apache.spark.api.r.RBackend.
> Here is the documentation I recorded regarding changes we need to enable 
> SparkR as an option for Apache Toree: 
> https://github.com/apache/incubator-toree/tree/master/sparkr-interpreter/src/main/resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17301) Remove unused classTag field from AtomicType base class

2016-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17301.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Remove unused classTag field from AtomicType base class
> ---
>
> Key: SPARK-17301
> URL: https://issues.apache.org/jira/browse/SPARK-17301
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> There's an unused {{classTag}} {{val}} in the {{AtomicType}} base class which 
> is causing unnecessary slowness in deserialization because it needs to grab 
> ScalaReflectionLock and create a new runtime reflection mirror. Removing this 
> unused code gives a small but measurable performance boost in SQL task 
> deserialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3162) Train DecisionTree locally when possible

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3162:
---

Assignee: Apache Spark

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible

2016-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447662#comment-15447662
 ] 

Apache Spark commented on SPARK-3162:
-

User 'smurching' has created a pull request for this issue:
https://github.com/apache/spark/pull/14872

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3162) Train DecisionTree locally when possible

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3162:
---

Assignee: (was: Apache Spark)

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products by default

2016-08-29 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447642#comment-15447642
 ] 

Srinath commented on SPARK-17298:
-

I've updated the description. Hopefully it is clearer.
Note that before this change, even with spark.sql.crossJoin.enabled = false,
case 1.a may sometimes NOT throw an error (i.e. execute successfully) depending 
on the physical plan chosen.
With the proposed  change, it would always throw an error

> Require explicit CROSS join for cartesian products by default
> -
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations under the 
> default configuration (spark.sql.crossJoin.enabled = false).
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. 
> Turning on the spark.sql.crossJoin.enabled configuration flag will disable 
> this check and allow cartesian products without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default

2016-08-29 Thread Srinath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinath updated SPARK-17298:

Description: 
Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations under the default 
configuration (spark.sql.crossJoin.enabled = false).
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. 
Turning on the spark.sql.crossJoin.enabled configuration flag will disable this 
check and allow cartesian products without an explicit cross join.

  was:
Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations under the default 
configuration with spark.sql.crossJoin.enabled = false.
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. 
Turning on the spark.sql.crossJoin.enabled configuration flag will disable this 
check and allow cartesian products without an explicit cross join.


> Require explicit CROSS join for cartesian products by default
> -
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations under the 
> default configuration (spark.sql.crossJoin.enabled = false).
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. 
> Turning on the spark.sql.crossJoin.enabled configuration flag will disable 
> this check and allow cartesian products without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default

2016-08-29 Thread Srinath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinath updated SPARK-17298:

Description: 
Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations under the default 
cross_join_.
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. 

Turning on the spark.sql.crossJoin.enabled configuration flag will disable this 
check and allow cartesian products without an explicit cross join.

  was:
Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations.
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration 
flag will disable this check and allow cartesian products without an explicit 
cross join.


> Require explicit CROSS join for cartesian products by default
> -
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations under the 
> default cross_join_.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. 
> Turning on the spark.sql.crossJoin.enabled configuration flag will disable 
> this check and allow cartesian products without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default

2016-08-29 Thread Srinath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinath updated SPARK-17298:

Description: 
Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations under the default 
configuration with spark.sql.crossJoin.enabled = false.
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. 
Turning on the spark.sql.crossJoin.enabled configuration flag will disable this 
check and allow cartesian products without an explicit cross join.

  was:
Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations under the default 
cross_join_.
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. 

Turning on the spark.sql.crossJoin.enabled configuration flag will disable this 
check and allow cartesian products without an explicit cross join.


> Require explicit CROSS join for cartesian products by default
> -
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations under the 
> default configuration with spark.sql.crossJoin.enabled = false.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. 
> Turning on the spark.sql.crossJoin.enabled configuration flag will disable 
> this check and allow cartesian products without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products by default

2016-08-29 Thread Srinath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinath updated SPARK-17298:

Summary: Require explicit CROSS join for cartesian products by default  
(was: Require explicit CROSS join for cartesian products)

> Require explicit CROSS join for cartesian products by default
> -
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17306) Memory leak in QuantileSummaries

2016-08-29 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17306:
---
Component/s: SQL

> Memory leak in QuantileSummaries
> 
>
> Key: SPARK-17306
> URL: https://issues.apache.org/jira/browse/SPARK-17306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> compressThreshold was not referenced anywhere
> {code}
> class QuantileSummaries(
> val compressThreshold: Int,
> val relativeError: Double,
> val sampled: ArrayBuffer[Stats] = ArrayBuffer.empty,
> private[stat] var count: Long = 0L,
> val headSampled: ArrayBuffer[Double] = ArrayBuffer.empty) extends 
> Serializable
> {code}
> And, it causes memory leak, QuantileSummaries takes unbounded memory
> {code}
> val summary = new QuantileSummaries(1, relativeError = 0.001)
> // Results in creating an array of size 1 !!! 
> (1 to 1).foreach(summary.insert(_))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17306) Memory leak in QuantileSummaries

2016-08-29 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17306:
--

 Summary: Memory leak in QuantileSummaries
 Key: SPARK-17306
 URL: https://issues.apache.org/jira/browse/SPARK-17306
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong


compressThreshold was not referenced anywhere

{code}
class QuantileSummaries(
val compressThreshold: Int,
val relativeError: Double,
val sampled: ArrayBuffer[Stats] = ArrayBuffer.empty,
private[stat] var count: Long = 0L,
val headSampled: ArrayBuffer[Double] = ArrayBuffer.empty) extends 
Serializable
{code}

And, it causes memory leak, QuantileSummaries takes unbounded memory
{code}
val summary = new QuantileSummaries(1, relativeError = 0.001)
// Results in creating an array of size 1 !!! 
(1 to 1).foreach(summary.insert(_))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark

2016-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447518#comment-15447518
 ] 

Apache Spark commented on SPARK-17304:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14871

> TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler 
> benchmark
> -
>
> Key: SPARK-17304
> URL: https://issues.apache.org/jira/browse/SPARK-17304
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> If you run
> {code}
> sc.parallelize(1 to 10, 10).map(identity).count()
> {code}
> then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one 
> performance hotspot in the scheduler, accounting for over half of the time. 
> This method was introduced in SPARK-15865, so this is a performance 
> regression in 2.1.0-SNAPSHOT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17304:


Assignee: Apache Spark  (was: Josh Rosen)

> TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler 
> benchmark
> -
>
> Key: SPARK-17304
> URL: https://issues.apache.org/jira/browse/SPARK-17304
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Minor
>
> If you run
> {code}
> sc.parallelize(1 to 10, 10).map(identity).count()
> {code}
> then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one 
> performance hotspot in the scheduler, accounting for over half of the time. 
> This method was introduced in SPARK-15865, so this is a performance 
> regression in 2.1.0-SNAPSHOT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17304:


Assignee: Josh Rosen  (was: Apache Spark)

> TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler 
> benchmark
> -
>
> Key: SPARK-17304
> URL: https://issues.apache.org/jira/browse/SPARK-17304
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> If you run
> {code}
> sc.parallelize(1 to 10, 10).map(identity).count()
> {code}
> then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one 
> performance hotspot in the scheduler, accounting for over half of the time. 
> This method was introduced in SPARK-15865, so this is a performance 
> regression in 2.1.0-SNAPSHOT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17305) Cannot save ML PipelineModel in pyspark, PipelineModel.params still return null values

2016-08-29 Thread Hechao Sun (JIRA)
Hechao Sun created SPARK-17305:
--

 Summary: Cannot save ML PipelineModel in pyspark,  
PipelineModel.params still return null values
 Key: SPARK-17305
 URL: https://issues.apache.org/jira/browse/SPARK-17305
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
 Environment: Python 2.7 Anaconda2 (64-bit) IDE
Spark standalone mode
Reporter: Hechao Sun


I used pyspark.ml module to run standalone ML tasks, but when I tried to save 
the PipelineModel, it gave me the following error messages:

Py4JJavaError: An error occurred while calling o8753.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2275.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2275.0 
(TID 7942, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:483)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:815)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:798)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:731)
at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:225)
at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:209)
at 
org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:305)
at 
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:294)
at 
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:326)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:393)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:802)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1199)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1904)
at 

[jira] [Updated] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark

2016-08-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17304:
---
Description: 
If you run

{code}
sc.parallelize(1 to 10, 10).map(identity).count()
{code}

then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one 
performance hotspot in the scheduler, accounting for over half of the time. 
This method was introduced in SPARK-15865, so this is a performance regression 
in 2.1.0-SNAPSHOT.

  was:
If you run

{code}
sc.parallelize(1 to 10, 10).map(identity).count()
{code}

then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one 
performance hotspot in the scheduler, accounting for over half of the time.


> TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler 
> benchmark
> -
>
> Key: SPARK-17304
> URL: https://issues.apache.org/jira/browse/SPARK-17304
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> If you run
> {code}
> sc.parallelize(1 to 10, 10).map(identity).count()
> {code}
> then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one 
> performance hotspot in the scheduler, accounting for over half of the time. 
> This method was introduced in SPARK-15865, so this is a performance 
> regression in 2.1.0-SNAPSHOT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark

2016-08-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17304:
---
Target Version/s: 2.1.0

> TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler 
> benchmark
> -
>
> Key: SPARK-17304
> URL: https://issues.apache.org/jira/browse/SPARK-17304
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> If you run
> {code}
> sc.parallelize(1 to 10, 10).map(identity).count()
> {code}
> then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one 
> performance hotspot in the scheduler, accounting for over half of the time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark

2016-08-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17304:
---
Affects Version/s: 2.1.0

> TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler 
> benchmark
> -
>
> Key: SPARK-17304
> URL: https://issues.apache.org/jira/browse/SPARK-17304
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> If you run
> {code}
> sc.parallelize(1 to 10, 10).map(identity).count()
> {code}
> then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one 
> performance hotspot in the scheduler, accounting for over half of the time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark

2016-08-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17304:
---
  Assignee: Josh Rosen
Issue Type: Bug  (was: Improvement)

> TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler 
> benchmark
> -
>
> Key: SPARK-17304
> URL: https://issues.apache.org/jira/browse/SPARK-17304
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> If you run
> {code}
> sc.parallelize(1 to 10, 10).map(identity).count()
> {code}
> then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one 
> performance hotspot in the scheduler, accounting for over half of the time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17304) TaskSetManager.abortIfCompletelyBlacklisted is a perf. hotspot in scheduler benchmark

2016-08-29 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-17304:
--

 Summary: TaskSetManager.abortIfCompletelyBlacklisted is a perf. 
hotspot in scheduler benchmark
 Key: SPARK-17304
 URL: https://issues.apache.org/jira/browse/SPARK-17304
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Josh Rosen
Priority: Minor


If you run

{code}
sc.parallelize(1 to 10, 10).map(identity).count()
{code}

then {{TaskSetManager.abortIfCompletelyBlacklisted()}} is the number-one 
performance hotspot in the scheduler, accounting for over half of the time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17303) dev/run-tests fails if spark-warehouse directory exists

2016-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447344#comment-15447344
 ] 

Apache Spark commented on SPARK-17303:
--

User 'frreiss' has created a pull request for this issue:
https://github.com/apache/spark/pull/14870

> dev/run-tests fails if spark-warehouse directory exists
> ---
>
> Key: SPARK-17303
> URL: https://issues.apache.org/jira/browse/SPARK-17303
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Frederick Reiss
>Priority: Minor
>
> The script dev/run-tests, which is intended for verifying the correctness of 
> pull requests, runs Apache RAT to check for missing Apache license headers. 
> Later, the script does a full compile/package/test sequence.
> The script as currently written works fine the first time. But the second 
> time it runs, the Apache RAT checks fail due to the presence of the directory 
> spark-warehouse, which the script indirectly creates during its regression 
> test run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17303) dev/run-tests fails if spark-warehouse directory exists

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17303:


Assignee: (was: Apache Spark)

> dev/run-tests fails if spark-warehouse directory exists
> ---
>
> Key: SPARK-17303
> URL: https://issues.apache.org/jira/browse/SPARK-17303
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Frederick Reiss
>Priority: Minor
>
> The script dev/run-tests, which is intended for verifying the correctness of 
> pull requests, runs Apache RAT to check for missing Apache license headers. 
> Later, the script does a full compile/package/test sequence.
> The script as currently written works fine the first time. But the second 
> time it runs, the Apache RAT checks fail due to the presence of the directory 
> spark-warehouse, which the script indirectly creates during its regression 
> test run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17303) dev/run-tests fails if spark-warehouse directory exists

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17303:


Assignee: Apache Spark

> dev/run-tests fails if spark-warehouse directory exists
> ---
>
> Key: SPARK-17303
> URL: https://issues.apache.org/jira/browse/SPARK-17303
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Frederick Reiss
>Assignee: Apache Spark
>Priority: Minor
>
> The script dev/run-tests, which is intended for verifying the correctness of 
> pull requests, runs Apache RAT to check for missing Apache license headers. 
> Later, the script does a full compile/package/test sequence.
> The script as currently written works fine the first time. But the second 
> time it runs, the Apache RAT checks fail due to the presence of the directory 
> spark-warehouse, which the script indirectly creates during its regression 
> test run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17303) dev/run-tests fails if spark-warehouse directory exists

2016-08-29 Thread Frederick Reiss (JIRA)
Frederick Reiss created SPARK-17303:
---

 Summary: dev/run-tests fails if spark-warehouse directory exists
 Key: SPARK-17303
 URL: https://issues.apache.org/jira/browse/SPARK-17303
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Frederick Reiss
Priority: Minor


The script dev/run-tests, which is intended for verifying the correctness of 
pull requests, runs Apache RAT to check for missing Apache license headers. 
Later, the script does a full compile/package/test sequence.

The script as currently written works fine the first time. But the second time 
it runs, the Apache RAT checks fail due to the presence of the directory 
spark-warehouse, which the script indirectly creates during its regression test 
run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-29 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447285#comment-15447285
 ] 

Gang Wu commented on SPARK-17243:
-

Yup you're right. I finally got some app_ids that were not in the summary page 
but their urls can be accessed. Our cluster has 100K+ app_ids so it took me a 
long time to figure it out. Thanks for your help!

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-29 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447266#comment-15447266
 ] 

Alex Bozarth commented on SPARK-17243:
--

that's odd, how long did you wait before accessing the app url? because the 
history server still needs to propagate after starting and that can take a long 
time, I was testing with a limit of 50 and testing an app in the thousands and 
it took about 5min to propagate for me to see it

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17302) Cannot set non-Spark SQL session variables in hive-site.xml, spark-defaults.conf, or using --conf

2016-08-29 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-17302:
-

 Summary: Cannot set non-Spark SQL session variables in 
hive-site.xml, spark-defaults.conf, or using --conf
 Key: SPARK-17302
 URL: https://issues.apache.org/jira/browse/SPARK-17302
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Ryan Blue


When configuration changed for 2.0 to the new SparkSession structure, Spark 
stopped using Hive's internal HiveConf for session state and now uses 
HiveSessionState and an associated SQLConf. Now, session options like 
hive.exec.compress.output and hive.exec.dynamic.partition.mode are pulled from 
this SQLConf. This doesn't include session properties from hive-site.xml 
(including hive.exec.compress.output), and no longer contains Spark-specific 
overrides from spark-defaults.conf that used the spark.hadoop.hive... pattern.

Also, setting these variables on the command-line no longer works because 
settings must start with "spark.".

Is there a recommended way to set Hive session properties?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17301) Remove unused classTag field from AtomicType base class

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17301:


Assignee: Apache Spark  (was: Josh Rosen)

> Remove unused classTag field from AtomicType base class
> ---
>
> Key: SPARK-17301
> URL: https://issues.apache.org/jira/browse/SPARK-17301
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Minor
>
> There's an unused {{classTag}} {{val}} in the {{AtomicType}} base class which 
> is causing unnecessary slowness in deserialization because it needs to grab 
> ScalaReflectionLock and create a new runtime reflection mirror. Removing this 
> unused code gives a small but measurable performance boost in SQL task 
> deserialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17301) Remove unused classTag field from AtomicType base class

2016-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447191#comment-15447191
 ] 

Apache Spark commented on SPARK-17301:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14869

> Remove unused classTag field from AtomicType base class
> ---
>
> Key: SPARK-17301
> URL: https://issues.apache.org/jira/browse/SPARK-17301
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> There's an unused {{classTag}} {{val}} in the {{AtomicType}} base class which 
> is causing unnecessary slowness in deserialization because it needs to grab 
> ScalaReflectionLock and create a new runtime reflection mirror. Removing this 
> unused code gives a small but measurable performance boost in SQL task 
> deserialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17301) Remove unused classTag field from AtomicType base class

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17301:


Assignee: Josh Rosen  (was: Apache Spark)

> Remove unused classTag field from AtomicType base class
> ---
>
> Key: SPARK-17301
> URL: https://issues.apache.org/jira/browse/SPARK-17301
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> There's an unused {{classTag}} {{val}} in the {{AtomicType}} base class which 
> is causing unnecessary slowness in deserialization because it needs to grab 
> ScalaReflectionLock and create a new runtime reflection mirror. Removing this 
> unused code gives a small but measurable performance boost in SQL task 
> deserialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17301) Remove unused classTag field from AtomicType base class

2016-08-29 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-17301:
--

 Summary: Remove unused classTag field from AtomicType base class
 Key: SPARK-17301
 URL: https://issues.apache.org/jira/browse/SPARK-17301
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Minor


There's an unused {{classTag}} {{val}} in the {{AtomicType}} base class which 
is causing unnecessary slowness in deserialization because it needs to grab 
ScalaReflectionLock and create a new runtime reflection mirror. Removing this 
unused code gives a small but measurable performance boost in SQL task 
deserialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-29 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447172#comment-15447172
 ] 

Gang Wu commented on SPARK-17243:
-

I imported the last change. I can get all application list from rest endpoint 
/api/v1/applications, (without limit parameter). However, the web UI indicates 
the app_id is not found when I specify the app_id. I can get it using spark 1.5 
history server. 

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17300) ClosedChannelException caused by missing block manager when speculative tasks are killed

2016-08-29 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-17300:
-

 Summary: ClosedChannelException caused by missing block manager 
when speculative tasks are killed
 Key: SPARK-17300
 URL: https://issues.apache.org/jira/browse/SPARK-17300
 Project: Spark
  Issue Type: Bug
Reporter: Ryan Blue


We recently backported SPARK-10530 to our Spark build, which kills unnecessary 
duplicate/speculative tasks when one completes (either a speculative task or 
the original). In large jobs with 500+ executors, this caused some executors to 
die and resulted in the same error that was fixed by SPARK-15262: 
ClosedChannelException when trying to connect to the block manager on affected 
hosts.

{code}
java.nio.channels.ClosedChannelException
at 
org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239)
at 
org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567)
at 
io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
at 
io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
at 
io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedChannelException
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-29 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447112#comment-15447112
 ] 

Alex Bozarth commented on SPARK-17243:
--

[~wgtmac] I'm not sure which version of the pr you tested, in my initial commit 
the issue you saw still existed but I updated it EOD Friday to switch to a 
version that only restricts the summary display, leaving all the applications 
available via their direct url as you would expect.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14662) LinearRegressionModel uses only default parameters if yStd is 0

2016-08-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14662.
---
   Resolution: Fixed
 Assignee: Yanbo Liang
Fix Version/s: 2.0.0

This has now been solved in [SPARK-15339]

> LinearRegressionModel uses only default parameters if yStd is 0
> ---
>
> Key: SPARK-14662
> URL: https://issues.apache.org/jira/browse/SPARK-14662
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Louis Traynard
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> In the (rare) case when yStd is 0 in LinearRegression, parameters are not 
> copied immediately to the created LinearRegressionModel instance. But they 
> should (as in all other cases), because currently only the default column 
> names are used in and returned by model.findSummaryModelAndPredictionCol(), 
> which leads to schema validation errors.
> The fix is two lines and should not be hard to check & apply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-29 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447099#comment-15447099
 ] 

Gang Wu commented on SPARK-17243:
-

I've test this PR. It indeed reduces the number of application metadata list. I 
think it intends to restrict only the summary page; jobs that are dropped from 
summary web ui should still be available via its URL like 
http://x.x.x.x:18080/history/application_id/jobs. However, those dropped ones 
cannot be accessed. This may heavily decrease the usability of history server.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447015#comment-15447015
 ] 

Sean Owen commented on SPARK-17298:
---

Agree, though spark.sql.crossJoin.enabled=false by default, so the queries 
result in an error right now. This disagrees with 
https://issues.apache.org/jira/browse/SPARK-17298?focusedCommentId=15446920=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15446920

This change allows it to work where it didn't before. I mean it's inaccurate to 
say that the change is to require this new syntax, because as before it can be 
allowed by changing the flag too.

> Require explicit CROSS join for cartesian products
> --
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446988#comment-15446988
 ] 

Sameer Agarwal commented on SPARK-17298:


Sean, if I understand correctly, here are the new semantics Srinath is 
proposing:

1. Case 1: spark.sql.crossJoin.enabled = false
(a) select * from A inner join B *throws an error*
(b) select * from A cross join B *doesn't throw an error*
2. Case 2: spark.sql.crossJoin.enabled = true
(a) select * from A inner join B *doesn't throw an error*
(b) select * from A cross join B *doesn't throw an error*

1(a) and 2(a) confirm with the existing semantics in Spark. This PR proposes 
1(b) and 2(b).

> Require explicit CROSS join for cartesian products
> --
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17296) Spark SQL: cross join + two joins = BUG

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17296:


Assignee: Apache Spark

> Spark SQL: cross join + two joins = BUG
> ---
>
> Key: SPARK-17296
> URL: https://issues.apache.org/jira/browse/SPARK-17296
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Assignee: Apache Spark
>
> In spark shell :
> {code}
> CREATE TABLE test (col INT) ;
> INSERT OVERWRITE TABLE test VALUES (1), (2) ;
> SELECT 
> COUNT(1)
> FROM test T1 
> CROSS JOIN test T2
> JOIN test T3
> ON T3.col = T1.col
> JOIN test T4
> ON T4.col = T1.col
> ;
> {code}
> returns :
> {code}
> Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; 
> line 6 pos 12
> {code}
> Apparently, this example is minimal (removing the CROSS or one of the JOIN 
> causes no issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17296) Spark SQL: cross join + two joins = BUG

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17296:


Assignee: (was: Apache Spark)

> Spark SQL: cross join + two joins = BUG
> ---
>
> Key: SPARK-17296
> URL: https://issues.apache.org/jira/browse/SPARK-17296
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>
> In spark shell :
> {code}
> CREATE TABLE test (col INT) ;
> INSERT OVERWRITE TABLE test VALUES (1), (2) ;
> SELECT 
> COUNT(1)
> FROM test T1 
> CROSS JOIN test T2
> JOIN test T3
> ON T3.col = T1.col
> JOIN test T4
> ON T4.col = T1.col
> ;
> {code}
> returns :
> {code}
> Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; 
> line 6 pos 12
> {code}
> Apparently, this example is minimal (removing the CROSS or one of the JOIN 
> causes no issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17296) Spark SQL: cross join + two joins = BUG

2016-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446979#comment-15446979
 ] 

Apache Spark commented on SPARK-17296:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/14867

> Spark SQL: cross join + two joins = BUG
> ---
>
> Key: SPARK-17296
> URL: https://issues.apache.org/jira/browse/SPARK-17296
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>
> In spark shell :
> {code}
> CREATE TABLE test (col INT) ;
> INSERT OVERWRITE TABLE test VALUES (1), (2) ;
> SELECT 
> COUNT(1)
> FROM test T1 
> CROSS JOIN test T2
> JOIN test T3
> ON T3.col = T1.col
> JOIN test T4
> ON T4.col = T1.col
> ;
> {code}
> returns :
> {code}
> Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; 
> line 6 pos 12
> {code}
> Apparently, this example is minimal (removing the CROSS or one of the JOIN 
> causes no issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446946#comment-15446946
 ] 

Sean Owen commented on SPARK-17298:
---

This should be an error unless you set the property to allow cartesian joins. 
At least, it was for me when I tried it this week on 1.6. It's possible 
something changed in which case disregard this, if you've tested it.

> Require explicit CROSS join for cartesian products
> --
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces

2016-08-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446930#comment-15446930
 ] 

Sean Owen commented on SPARK-17299:
---

That's probably the intent yeah. If that's how the other engines treat TRIM 
then this is a bug. I can see it is indeed implemented internally with 
String.trim().
CC [~chenghao] for 
https://github.com/apache/spark/commit/0b0b9ceaf73de472198c9804fb7ae61fa2a2e097

> TRIM/LTRIM/RTRIM strips characters other than spaces
> 
>
> Key: SPARK-17299
> URL: https://issues.apache.org/jira/browse/SPARK-17299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Jeremy Beard
>Priority: Minor
>
> TRIM/LTRIM/RTRIM docs state that they only strip spaces:
> http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column)
> But the implementation strips all characters of ASCII value 20 or less:
> https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446920#comment-15446920
 ] 

Srinath commented on SPARK-17298:
-

So if I do the following:

create temporary view nt1 as select * from values   

  ("one", 1),   

  ("two", 2),   

  ("three", 3)  

  as nt1(k, v1);



create temporary view nt2 as select * from values   

  ("one", 1),   

  ("two", 22),  

  ("one", 5)

  as nt2(k, v2);

SELECT * FROM nt1, nt2; -- or
select * FROM nt1 inner join nt2;

The SELECT queries do not in fact result in an error. The proposed change would 
have them return an error

> Require explicit CROSS join for cartesian products
> --
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces

2016-08-29 Thread Jeremy Beard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446902#comment-15446902
 ] 

Jeremy Beard commented on SPARK-17299:
--

What is the priority for compatibility with other SQL dialects? TRIM in (at 
least) Hive, Impala, Oracle, and Teradata is just for spaces.

> TRIM/LTRIM/RTRIM strips characters other than spaces
> 
>
> Key: SPARK-17299
> URL: https://issues.apache.org/jira/browse/SPARK-17299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Jeremy Beard
>Priority: Minor
>
> TRIM/LTRIM/RTRIM docs state that they only strip spaces:
> http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column)
> But the implementation strips all characters of ASCII value 20 or less:
> https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16581) Making JVM backend calling functions public

2016-08-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-16581.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14775
[https://github.com/apache/spark/pull/14775]

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
> Fix For: 2.0.1, 2.1.0
>
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16581) Making JVM backend calling functions public

2016-08-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-16581:
-

Assignee: Shivaram Venkataraman

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 2.0.1, 2.1.0
>
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces

2016-08-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17299:
--
   Priority: Minor  (was: Major)
Component/s: Documentation

I'm almost certain its intent is to match the behavior of String.trim() in 
Java. Unless there's a reason to believe it should behave otherwise I think the 
docs should be fixed to resolve this.

> TRIM/LTRIM/RTRIM strips characters other than spaces
> 
>
> Key: SPARK-17299
> URL: https://issues.apache.org/jira/browse/SPARK-17299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Jeremy Beard
>Priority: Minor
>
> TRIM/LTRIM/RTRIM docs state that they only strip spaces:
> http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column)
> But the implementation strips all characters of ASCII value 20 or less:
> https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces

2016-08-29 Thread Jeremy Beard (JIRA)
Jeremy Beard created SPARK-17299:


 Summary: TRIM/LTRIM/RTRIM strips characters other than spaces
 Key: SPARK-17299
 URL: https://issues.apache.org/jira/browse/SPARK-17299
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Jeremy Beard


TRIM/LTRIM/RTRIM docs state that they only strip spaces:

http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column)

But the implementation strips all characters of ASCII value 20 or less:

https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446740#comment-15446740
 ] 

Sean Owen commented on SPARK-17298:
---

This already results in an error. You mean that it will not result in an error 
if you specify it explicitly in the query right?

> Require explicit CROSS join for cartesian products
> --
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17298:


Assignee: Apache Spark

> Require explicit CROSS join for cartesian products
> --
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Assignee: Apache Spark
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17298:


Assignee: (was: Apache Spark)

> Require explicit CROSS join for cartesian products
> --
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446719#comment-15446719
 ] 

Apache Spark commented on SPARK-17298:
--

User 'srinathshankar' has created a pull request for this issue:
https://github.com/apache/spark/pull/14866

> Require explicit CROSS join for cartesian products
> --
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446682#comment-15446682
 ] 

Srinath commented on SPARK-17298:
-

You are correct that with this change, queries of the form
{noformat}
select * from A inner join B
{noformat}
will now throw an error where previously they would not. 
The reason for this suggestion is that users may often forget to specify join 
conditions altogether, leading to incorrect, long-running queries. Requiring 
explicit cross joins helps clarify intent.

Turning on the spark.sql.crossJoin.enabled flag will revert to previous 
behavior.

> Require explicit CROSS join for cartesian products
> --
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16578) Configurable hostname for RBackend

2016-08-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-16578:
--
Issue Type: New Feature  (was: Sub-task)
Parent: (was: SPARK-15799)

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17063) MSCK REPAIR TABLE is super slow with Hive metastore

2016-08-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17063.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14607
[https://github.com/apache/spark/pull/14607]

> MSCK REPAIR TABLE is super slow with Hive metastore
> ---
>
> Key: SPARK-17063
> URL: https://issues.apache.org/jira/browse/SPARK-17063
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.1, 2.1.0
>
>
> When repair a table with thousands of partitions, it could take hundreds of 
> seconds, Hive metastore can only add a few partitioins per seconds, because 
> it will list all the files for each partition to gather the fast stats 
> (number of files, total size of files).
> We could improve this by listing the files in Spark in parallel, than sending 
> the fast stats to Hive metastore to avoid this sequential listing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17110) Pyspark with locality ANY throw java.io.StreamCorruptedException

2016-08-29 Thread Tomer Kaftan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1544#comment-1544
 ] 

Tomer Kaftan commented on SPARK-17110:
--

Hi Miao, all that is needed using the fully default configurations with 2 
slaves is to just set the number of worker cores per slave to 1.

That is, putting the following in /spark/conf/spark-env.sh
{code}
SPARK_WORKER_CORES=1
{code}


More concretely (if you’re looking for exact steps), the way I start up the 
cluster & reproduce this example is as follows.

I use the Spark EC2 scripts from the PR here:
https://github.com/amplab/spark-ec2/pull/46

I launch the cluster on 2 r3.xlarge machines (but any machine should work, 
though you may need to change a later sed command):

{code}
./spark-ec2 -k ec2_ssh_key -i path_to_key_here -s 2 -t r3.xlarge launch 
temp-cluster --spot-price=1.00 --spark-version=2.0.0 --region=us-west-2 
--hadoop-major-version=yarn
{code}

I update the number of worker cores and launch the pyspark shell:
{code}
sed -i'f' 's/SPARK_WORKER_CORES=4/SPARK_WORKER_CORES=1/g' 
/root/spark/conf/spark-env.sh
~/spark-ec2/copy-dir ~/spark/conf/
~/spark/sbin/stop-all.sh
~/spark/sbin/start-all.sh
~/spark/bin/pyspark
{code}

And then I run the example I included at the start:
{code}
x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
x.count()

import time
def waitMap(x):
time.sleep(x)
return x

x.map(waitMap).count()
{code}

> Pyspark with locality ANY throw java.io.StreamCorruptedException
> 
>
> Key: SPARK-17110
> URL: https://issues.apache.org/jira/browse/SPARK-17110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: Cluster of 2 AWS r3.xlarge slaves launched via ec2 
> scripts, Spark 2.0.0, hadoop: yarn, pyspark shell
>Reporter: Tomer Kaftan
>Priority: Critical
>
> In Pyspark 2.0.0, any task that accesses cached data non-locally throws a 
> StreamCorruptedException like the stacktrace below:
> {noformat}
> WARN TaskSetManager: Lost task 7.0 in stage 2.0 (TID 26, 172.31.26.184): 
> java.io.StreamCorruptedException: invalid stream header: 12010A80
> at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:807)
> at java.io.ObjectInputStream.(ObjectInputStream.java:302)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
> at 
> org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:146)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:524)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:522)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522)
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The simplest way I have found to reproduce this is by running the following 
> code in the pyspark shell, on a cluster of 2 slaves set to use only one 
> worker core each:
> {code}
> x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
> x.count()
> import time
> def waitMap(x):
> time.sleep(x)
> return x
> x.map(waitMap).count()
> {code}
> Or by running the following via spark-submit:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
> x.count()
> import time
> def waitMap(x):
> time.sleep(x)
> return x
> x.map(waitMap).count()
> {code}



--
This 

[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446615#comment-15446615
 ] 

Apache Spark commented on SPARK-17289:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/14865

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Blocker
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17289:


Assignee: (was: Apache Spark)

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Blocker
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17289:


Assignee: Apache Spark

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Assignee: Apache Spark
>Priority: Blocker
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17298:
--
Priority: Minor  (was: Major)

Hm, aren't you suggesting that cartesian joins be _allowed_ when explicitly 
requested, regardless of the global flag? that's not the same as requiring this 
syntax, which probably isn't feasible as it would break things.

> Require explicit CROSS join for cartesian products
> --
>
> Key: SPARK-17298
> URL: https://issues.apache.org/jira/browse/SPARK-17298
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Srinath
>Priority: Minor
>
> Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame 
> API) to specify explicit cartesian products between relations.
> By cartesian product we mean a join between relations R and S where there is 
> no join condition involving columns from both R and S.
> If a cartesian product is detected in the absence of an explicit CROSS join, 
> an error must be thrown. Turning on the spark.sql.crossJoin.enabled 
> configuration flag will disable this check and allow cartesian products 
> without an explicit cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17298) Require explicit CROSS join for cartesian products

2016-08-29 Thread Srinath (JIRA)
Srinath created SPARK-17298:
---

 Summary: Require explicit CROSS join for cartesian products
 Key: SPARK-17298
 URL: https://issues.apache.org/jira/browse/SPARK-17298
 Project: Spark
  Issue Type: Story
  Components: SQL
Reporter: Srinath


Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) 
to specify explicit cartesian products between relations.
By cartesian product we mean a join between relations R and S where there is no 
join condition involving columns from both R and S.

If a cartesian product is detected in the absence of an explicit CROSS join, an 
error must be thrown. Turning on the spark.sql.crossJoin.enabled configuration 
flag will disable this check and allow cartesian products without an explicit 
cross join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17297) window function generates unexpected results due to startTime being relative to UTC

2016-08-29 Thread Pete Baker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pete Baker updated SPARK-17297:
---
Description: 
In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
slideDuration, String startTime)}} function {{startTime}} parameter behaves as 
follows:

{quote}
startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
start window intervals. For example, in order to have hourly tumbling windows 
that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
startTime as 15 minutes.
{quote}

Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
expect to see events from each day fall into the correct day's bucket.   This 
doesn't happen as expected in every case, however, due to the way that this 
feature and timestamp / timezone support interact. 

Using a fixed UTC reference, there is an assumption that all days are the same 
length ({{1 day === 24 h}}}).  This is not the case for most timezones where 
the offset from UTC changes by 1h for 6 months out of the year.  In this case, 
on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long.

The result of this is that, for daylight savings time, some rows within 1h of 
midnight are aggregated to the wrong day.

Options for a fix are:

# This is the expected behavior, and the window() function should not be used 
for this type of aggregation with a long window length and the {{window}} 
function documentation should be updated as such, or
# The window function should respect timezones and work on the assumption that 
{{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the local 
timezone, rather than UTC.
# Support for both absolute and relative window lengths may be added

  was:
In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
slideDuration, String startTime)}} function {{startTime}} parameter behaves as 
follows:

{quote}
startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
start window intervals. For example, in order to have hourly tumbling windows 
that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
startTime as 15 minutes.
{quote}

Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
expect to see events from each day fall into the correct day's bucket.   This 
doesn't happen as expected in every case, however, due to the way that this 
feature and timestamp / timezone support interact. 

Using a fixed UTC reference, there is an assumption that all days are the same 
length ({{1 day === 24 h}}}).  This is not the case for most timezones where 
the offset from UTC changes by 1h for 6 months out of the year.  In this case, 
on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long.

The result of this is that, for daylight savings time, some rows within 1h of 
midnight are aggregated to the wrong day.

Options for a fix are:

* This is the expected behavior, and the window() function should not be used 
for this type of aggregation with a long window length and the {{window}} 
function documentation should be updated as such, or
* The window function should respect timezones and work on the assumption that 
{{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the local 
timezone, rather than UTC.
* Support for both absolute and relative window lengths may be added


> window function generates unexpected results due to startTime being relative 
> to UTC
> ---
>
> Key: SPARK-17297
> URL: https://issues.apache.org/jira/browse/SPARK-17297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Pete Baker
>
> In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
> slideDuration, String startTime)}} function {{startTime}} parameter behaves 
> as follows:
> {quote}
> startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
> start window intervals. For example, in order to have hourly tumbling windows 
> that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
> startTime as 15 minutes.
> {quote}
> Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
> expect to see events from each day fall into the correct day's bucket.   This 
> doesn't happen as expected in every case, however, due to the way that this 
> feature and timestamp / timezone support interact. 
> Using a fixed UTC reference, there is an assumption that all days are the 
> same length ({{1 day === 24 h}}}).  This is not the case for most timezones 
> where the offset from UTC changes by 1h for 6 months out of the year.  In 
> this case, on the days that clocks 

[jira] [Updated] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA

2016-08-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16240:
--
Shepherd: Joseph K. Bradley

> model loading backward compatibility for ml.clustering.LDA
> --
>
> Key: SPARK-16240
> URL: https://issues.apache.org/jira/browse/SPARK-16240
> Project: Spark
>  Issue Type: Bug
>Reporter: yuhao yang
>Assignee: Gayathri Murali
>
> After resolving the matrix conversion issue, LDA model still cannot load 1.6 
> models as one of the parameter name is changed.
> https://github.com/apache/spark/pull/12065
> We can perhaps add some special logic in the loading code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17297) window function generates unexpected results due to startTime being relative to UTC

2016-08-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446391#comment-15446391
 ] 

Sean Owen commented on SPARK-17297:
---

I don't think there's an assumption about what a day is in here, because 
underneath this is all done in terms of absolute time in microseconds (since 
the epoch). You would not be able to do aggregations whose window varied a bit 
to start and end on day boundaries with respect to some calendar with a 
straight window() call this way. I think you could propose a documentation 
update that notes that, for example, "1 day" just means "24*60*60*1000*1000" 
microseconds, not a day according to a calendar.

There are other ways to get the effect you need but will entail watching a 
somewhat larger window of time so you will always eventually find one batch 
interval where you see the whole day of data.

> window function generates unexpected results due to startTime being relative 
> to UTC
> ---
>
> Key: SPARK-17297
> URL: https://issues.apache.org/jira/browse/SPARK-17297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Pete Baker
>
> In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
> slideDuration, String startTime)}} function {{startTime}} parameter behaves 
> as follows:
> {quote}
> startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
> start window intervals. For example, in order to have hourly tumbling windows 
> that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
> startTime as 15 minutes.
> {quote}
> Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
> expect to see events from each day fall into the correct day's bucket.   This 
> doesn't happen as expected in every case, however, due to the way that this 
> feature and timestamp / timezone support interact. 
> Using a fixed UTC reference, there is an assumption that all days are the 
> same length ({{1 day === 24 h}}}).  This is not the case for most timezones 
> where the offset from UTC changes by 1h for 6 months out of the year.  In 
> this case, on the days that clocks go forward/back, one day is 23h long, 1 
> day is 25h long.
> The result of this is that, for daylight savings time, some rows within 1h of 
> midnight are aggregated to the wrong day.
> Options for a fix are:
> * This is the expected behavior, and the window() function should not be used 
> for this type of aggregation with a long window length and the {{window}} 
> function documentation should be updated as such, or
> * The window function should respect timezones and work on the assumption 
> that {{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the 
> local timezone, rather than UTC.
> * Support for both absolute and relative window lengths may be added



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17297) window function generates unexpected results due to startTime being relative to UTC

2016-08-29 Thread Pete Baker (JIRA)
Pete Baker created SPARK-17297:
--

 Summary: window function generates unexpected results due to 
startTime being relative to UTC
 Key: SPARK-17297
 URL: https://issues.apache.org/jira/browse/SPARK-17297
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Pete Baker


In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
slideDuration, String startTime)}} function {{startTime}} parameter behaves as 
follows:

{quote}
startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
start window intervals. For example, in order to have hourly tumbling windows 
that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
startTime as 15 minutes.
{quote}

Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
expect to see events from each day fall into the correct day's bucket.   This 
doesn't happen as expected in every case, however, due to the way that this 
feature and timestamp / timezone support interact. 

Using a fixed UTC reference, there is an assumption that all days are the same 
length ({{1 day === 24 h}}}).  This is not the case for most timezones where 
the offset from UTC changes by 1h for 6 months out of the year.  In this case, 
on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long.

The result of this is that, for daylight savings time, some rows within 1h of 
midnight are aggregated to the wrong day.

Either:

* This is the expected behavior, and the window() function should not be used 
for this type of aggregation with a long window length and the {{window}} 
function documentation should be updated as such, or
* The window function should respect timezones and work on the assumption that 
{{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the local 
timezone, rather than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17296) Spark SQL: cross join + two joins = BUG

2016-08-29 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446371#comment-15446371
 ] 

Herman van Hovell commented on SPARK-17296:
---

I think you have found a bug in the parser. Your query produces the following 
(unresolved) LogicalPlan:
{noformat}
'Project [unresolvedalias('COUNT(1), None)]
+- 'Join Inner, ('T4.col = 'T1.col)
   :- 'Join Inner
   :  :- 'UnresolvedRelation `test`, T1
   :  +- 'Join Inner, ('T3.col = 'T1.col)
   : :- 'UnresolvedRelation `test`, T2
   : +- 'UnresolvedRelation `test`, T3
   +- 'UnresolvedRelation `test`, T4
{noformat}

Notice how the the most nested Inner Join references T2 and T3 using a join 
condition on T1 (which is an unknown relation for that join).

> Spark SQL: cross join + two joins = BUG
> ---
>
> Key: SPARK-17296
> URL: https://issues.apache.org/jira/browse/SPARK-17296
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>
> In spark shell :
> {code}
> CREATE TABLE test (col INT) ;
> INSERT OVERWRITE TABLE test VALUES (1), (2) ;
> SELECT 
> COUNT(1)
> FROM test T1 
> CROSS JOIN test T2
> JOIN test T3
> ON T3.col = T1.col
> JOIN test T4
> ON T4.col = T1.col
> ;
> {code}
> returns :
> {code}
> Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; 
> line 6 pos 12
> {code}
> Apparently, this example is minimal (removing the CROSS or one of the JOIN 
> causes no issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17296) Spark SQL: cross join + two joins = BUG

2016-08-29 Thread Furcy Pin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446382#comment-15446382
 ] 

Furcy Pin commented on SPARK-17296:
---

Yes, this is not critical though, a workaround is to invert the order between 
T1 and T2.

> Spark SQL: cross join + two joins = BUG
> ---
>
> Key: SPARK-17296
> URL: https://issues.apache.org/jira/browse/SPARK-17296
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>
> In spark shell :
> {code}
> CREATE TABLE test (col INT) ;
> INSERT OVERWRITE TABLE test VALUES (1), (2) ;
> SELECT 
> COUNT(1)
> FROM test T1 
> CROSS JOIN test T2
> JOIN test T3
> ON T3.col = T1.col
> JOIN test T4
> ON T4.col = T1.col
> ;
> {code}
> returns :
> {code}
> Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; 
> line 6 pos 12
> {code}
> Apparently, this example is minimal (removing the CROSS or one of the JOIN 
> causes no issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17297) window function generates unexpected results due to startTime being relative to UTC

2016-08-29 Thread Pete Baker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pete Baker updated SPARK-17297:
---
Description: 
In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
slideDuration, String startTime)}} function {{startTime}} parameter behaves as 
follows:

{quote}
startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
start window intervals. For example, in order to have hourly tumbling windows 
that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
startTime as 15 minutes.
{quote}

Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
expect to see events from each day fall into the correct day's bucket.   This 
doesn't happen as expected in every case, however, due to the way that this 
feature and timestamp / timezone support interact. 

Using a fixed UTC reference, there is an assumption that all days are the same 
length ({{1 day === 24 h}}}).  This is not the case for most timezones where 
the offset from UTC changes by 1h for 6 months out of the year.  In this case, 
on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long.

The result of this is that, for daylight savings time, some rows within 1h of 
midnight are aggregated to the wrong day.

Options for a fix are:

* This is the expected behavior, and the window() function should not be used 
for this type of aggregation with a long window length and the {{window}} 
function documentation should be updated as such, or
* The window function should respect timezones and work on the assumption that 
{{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the local 
timezone, rather than UTC.
* Support for both absolute and relative window lengths may be added

  was:
In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
slideDuration, String startTime)}} function {{startTime}} parameter behaves as 
follows:

{quote}
startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
start window intervals. For example, in order to have hourly tumbling windows 
that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
startTime as 15 minutes.
{quote}

Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
expect to see events from each day fall into the correct day's bucket.   This 
doesn't happen as expected in every case, however, due to the way that this 
feature and timestamp / timezone support interact. 

Using a fixed UTC reference, there is an assumption that all days are the same 
length ({{1 day === 24 h}}}).  This is not the case for most timezones where 
the offset from UTC changes by 1h for 6 months out of the year.  In this case, 
on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long.

The result of this is that, for daylight savings time, some rows within 1h of 
midnight are aggregated to the wrong day.

Either:

* This is the expected behavior, and the window() function should not be used 
for this type of aggregation with a long window length and the {{window}} 
function documentation should be updated as such, or
* The window function should respect timezones and work on the assumption that 
{{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the local 
timezone, rather than UTC.


> window function generates unexpected results due to startTime being relative 
> to UTC
> ---
>
> Key: SPARK-17297
> URL: https://issues.apache.org/jira/browse/SPARK-17297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Pete Baker
>
> In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
> slideDuration, String startTime)}} function {{startTime}} parameter behaves 
> as follows:
> {quote}
> startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
> start window intervals. For example, in order to have hourly tumbling windows 
> that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
> startTime as 15 minutes.
> {quote}
> Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
> expect to see events from each day fall into the correct day's bucket.   This 
> doesn't happen as expected in every case, however, due to the way that this 
> feature and timestamp / timezone support interact. 
> Using a fixed UTC reference, there is an assumption that all days are the 
> same length ({{1 day === 24 h}}}).  This is not the case for most timezones 
> where the offset from UTC changes by 1h for 6 months out of the year.  In 
> this case, on the days that clocks go forward/back, one day is 23h long, 1 
> day is 25h long.
> The result of this is 

[jira] [Commented] (SPARK-17296) Spark SQL: cross join + two joins = BUG

2016-08-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446367#comment-15446367
 ] 

Sean Owen commented on SPARK-17296:
---

Pardon if I'm missing something, but you are not joining T3 with T1, so I don't 
think you can use T1.col in the join condition right?

> Spark SQL: cross join + two joins = BUG
> ---
>
> Key: SPARK-17296
> URL: https://issues.apache.org/jira/browse/SPARK-17296
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>
> In spark shell :
> {code}
> CREATE TABLE test (col INT) ;
> INSERT OVERWRITE TABLE test VALUES (1), (2) ;
> SELECT 
> COUNT(1)
> FROM test T1 
> CROSS JOIN test T2
> JOIN test T3
> ON T3.col = T1.col
> JOIN test T4
> ON T4.col = T1.col
> ;
> {code}
> returns :
> {code}
> Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; 
> line 6 pos 12
> {code}
> Apparently, this example is minimal (removing the CROSS or one of the JOIN 
> causes no issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446344#comment-15446344
 ] 

Takeshi Yamamuro commented on SPARK-17289:
--

okay. I'll add tests and open pr.

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Blocker
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446341#comment-15446341
 ] 

Herman van Hovell commented on SPARK-17289:
---

Looks good. Can you open a PR?

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Blocker
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446323#comment-15446323
 ] 

Takeshi Yamamuro commented on SPARK-17289:
--

This is probably because EnsureRequirements does not check if partial 
aggregation satisfies sort requirements.
We can fix this like;
https://github.com/apache/spark/compare/master...maropu:SPARK-17289#diff-cdb577e36041e4a27a605b6b3063fd54L167

cc: [~hvanhovell]

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Blocker
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15453) Improve join planning for bucketed / sorted tables

2016-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446283#comment-15446283
 ] 

Apache Spark commented on SPARK-15453:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/14864

> Improve join planning for bucketed / sorted tables
> --
>
> Key: SPARK-15453
> URL: https://issues.apache.org/jira/browse/SPARK-15453
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tejas Patil
>Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {noformat}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> : :- INPUT
> : +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> : +- WholeStageCodegen
> ::  +- Project [j#20,k#21,i#22]
> :: +- Filter (isnotnull(k#21) && isnotnull(j#20))
> ::+- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> +- WholeStageCodegen
>:  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>: +- INPUT
>+- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>   +- WholeStageCodegen
>  :  +- Project [j#23,k#24,i#25]
>  : +- Filter (isnotnull(k#24) && isnotnull(j#23))
>  :+- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16992) Pep8 code style

2016-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446255#comment-15446255
 ] 

Apache Spark commented on SPARK-16992:
--

User 'Stibbons' has created a pull request for this issue:
https://github.com/apache/spark/pull/14863

> Pep8 code style
> ---
>
> Key: SPARK-16992
> URL: https://issues.apache.org/jira/browse/SPARK-16992
> Project: Spark
>  Issue Type: Improvement
>Reporter: Semet
>
> Add code style checks and auto formating into the Python code.
> Features:
> - add a {{.editconfig}} file (Spark's Scala files use 2-spaces indentation, 
> while Python files uses 4) for compatible editors (almost every editors has a 
> plugin to support {{.editconfig}} file)
> - use autopep8 to fix basic pep8 mistakes
> - use isort to automatically sort and organise {{import}} statements and 
> organise them into logically linked order (see doc here. The most important 
> thing is that it splits import statements that loads more than one object 
> into several lines. It send keep the imports sorted. Said otherwise, for a 
> given module import, the line where it should be added will be fixed. This 
> will increase the number of line of the file, but this facilitates a lot file 
> maintainance and file merges if needed.
> add a 'validate.sh' script in order to automatise the correction (need isort 
> and autopep8 installed)
> You can see similar script in prod in the 
> [Buildbot|https://github.com/buildbot/buildbot/blob/master/common/validate.sh]
>  project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17296) Spark SQL: cross join + two joins = BUG

2016-08-29 Thread Furcy Pin (JIRA)
Furcy Pin created SPARK-17296:
-

 Summary: Spark SQL: cross join + two joins = BUG
 Key: SPARK-17296
 URL: https://issues.apache.org/jira/browse/SPARK-17296
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Furcy Pin


In spark shell :

{code}
CREATE TABLE test (col INT) ;
INSERT OVERWRITE TABLE test VALUES (1), (2) ;

SELECT 
COUNT(1)
FROM test T1 
CROSS JOIN test T2
JOIN test T3
ON T3.col = T1.col
JOIN test T4
ON T4.col = T1.col
;
{code}

returns :

{code}
Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; line 
6 pos 12
{code}

Apparently, this example is minimal (removing the CROSS or one of the JOIN 
causes no issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17295) Create TestHiveSessionState use reflect logic based on the setting of CATALOG_IMPLEMENTATION

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17295:


Assignee: Apache Spark

> Create TestHiveSessionState use reflect logic based on the setting of 
> CATALOG_IMPLEMENTATION
> 
>
> Key: SPARK-17295
> URL: https://issues.apache.org/jira/browse/SPARK-17295
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>
> Currently we create a new `TestHiveSessionState` in `TestHive`, but in 
> `SparkSession` we create `SessionState`/`HiveSessionState` use reflect logic 
> based on the setting of CATALOG_IMPLEMENTATION, we should make the both 
> consist, then we can test the reflect logic of `SparkSession` in `TestHive`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17295) Create TestHiveSessionState use reflect logic based on the setting of CATALOG_IMPLEMENTATION

2016-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446152#comment-15446152
 ] 

Apache Spark commented on SPARK-17295:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/14862

> Create TestHiveSessionState use reflect logic based on the setting of 
> CATALOG_IMPLEMENTATION
> 
>
> Key: SPARK-17295
> URL: https://issues.apache.org/jira/browse/SPARK-17295
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jiang Xingbo
>
> Currently we create a new `TestHiveSessionState` in `TestHive`, but in 
> `SparkSession` we create `SessionState`/`HiveSessionState` use reflect logic 
> based on the setting of CATALOG_IMPLEMENTATION, we should make the both 
> consist, then we can test the reflect logic of `SparkSession` in `TestHive`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17295) Create TestHiveSessionState use reflect logic based on the setting of CATALOG_IMPLEMENTATION

2016-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17295:


Assignee: (was: Apache Spark)

> Create TestHiveSessionState use reflect logic based on the setting of 
> CATALOG_IMPLEMENTATION
> 
>
> Key: SPARK-17295
> URL: https://issues.apache.org/jira/browse/SPARK-17295
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jiang Xingbo
>
> Currently we create a new `TestHiveSessionState` in `TestHive`, but in 
> `SparkSession` we create `SessionState`/`HiveSessionState` use reflect logic 
> based on the setting of CATALOG_IMPLEMENTATION, we should make the both 
> consist, then we can test the reflect logic of `SparkSession` in `TestHive`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17291) The shuffle data fetched based on netty were directly stored in off-memoryr?

2016-08-29 Thread song fengfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446142#comment-15446142
 ] 

song fengfei commented on SPARK-17291:
--

Thanks very much,this is the first time to create issue,and can't understand 
“user@ is the place for questions” before!Thanks again.

> The shuffle data fetched based on netty were directly stored in off-memoryr?
> 
>
> Key: SPARK-17291
> URL: https://issues.apache.org/jira/browse/SPARK-17291
> Project: Spark
>  Issue Type: IT Help
>  Components: Shuffle
>Reporter: song fengfei
>
> If the shuffle data were stored in off-memory,and isn't it easily lead to OOM 
> when a map output was too big?Additionly,it was also out of spark memory 
> management?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17295) Create TestHiveSessionState use reflect logic based on the setting of CATALOG_IMPLEMENTATION

2016-08-29 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-17295:


 Summary: Create TestHiveSessionState use reflect logic based on 
the setting of CATALOG_IMPLEMENTATION
 Key: SPARK-17295
 URL: https://issues.apache.org/jira/browse/SPARK-17295
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jiang Xingbo


Currently we create a new `TestHiveSessionState` in `TestHive`, but in 
`SparkSession` we create `SessionState`/`HiveSessionState` use reflect logic 
based on the setting of CATALOG_IMPLEMENTATION, we should make the both 
consist, then we can test the reflect logic of `SparkSession` in `TestHive`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17291) The shuffle data fetched based on netty were directly stored in off-memoryr?

2016-08-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446119#comment-15446119
 ] 

Sean Owen commented on SPARK-17291:
---

I replied on the other JIRA. Questions go to u...@spark.apache.org, and not to 
JIRA.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> The shuffle data fetched based on netty were directly stored in off-memoryr?
> 
>
> Key: SPARK-17291
> URL: https://issues.apache.org/jira/browse/SPARK-17291
> Project: Spark
>  Issue Type: IT Help
>  Components: Shuffle
>Reporter: song fengfei
>
> If the shuffle data were stored in off-memory,and isn't it easily lead to OOM 
> when a map output was too big?Additionly,it was also out of spark memory 
> management?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17291) The shuffle data fetched based on netty were directly stored in off-memoryr?

2016-08-29 Thread song fengfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446110#comment-15446110
 ] 

song fengfei commented on SPARK-17291:
--

yeah, they are same,but neither was resolved and instead, they were all changed 
to resolved?
Is it not allowed to ask question here?

> The shuffle data fetched based on netty were directly stored in off-memoryr?
> 
>
> Key: SPARK-17291
> URL: https://issues.apache.org/jira/browse/SPARK-17291
> Project: Spark
>  Issue Type: IT Help
>  Components: Shuffle
>Reporter: song fengfei
>
> If the shuffle data were stored in off-memory,and isn't it easily lead to OOM 
> when a map output was too big?Additionly,it was also out of spark memory 
> management?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17294) Caching invalidates data on mildly wide dataframes

2016-08-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17294.
---
Resolution: Duplicate

Duplicate #5, popular issue

> Caching invalidates data on mildly wide dataframes
> --
>
> Key: SPARK-17294
> URL: https://issues.apache.org/jira/browse/SPARK-17294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Kalle Jepsen
>
> Caching a dataframe with > 200 columns causes the data within to simply 
> vanish under certain circumstances.
> Consider the following code, where we create a one-row dataframe containing 
> the numbers from 0 to 200.
> {code}
> n_cols = 201
> rng = range(n_cols)
> df = spark.createDataFrame(
> data=[rng]
> )
> last = df.columns[-1]
> print(df.select(last).collect())
> df.select(F.greatest(*df.columns).alias('greatest')).show()
> {code}
> Returns:
> {noformat}
> [Row(_201=200)]
> ++
> |greatest|
> ++
> | 200|
> ++
> {noformat}
> As expected column {{_201}} contains the number 200 and as expected the 
> greatest value within that single row is 200.
> Now if we introduce a {{.cache}} on {{df}}:
> {code}
> n_cols = 201
> rng = range(n_cols)
> df = spark.createDataFrame(
> data=[rng]
> ).cache()
> last = df.columns[-1]
> print(df.select(last).collect())
> df.select(F.greatest(*df.columns).alias('greatest')).show()
> {code}
> Returns:
> {noformat}
> [Row(_201=200)]
> ++
> |greatest|
> ++
> |   0|
> ++
> {noformat}
> the last column {{_201}} still seems to contain the correct value, but when I 
> try to select the greatest value within the row, 0 is returned. When I issue 
> {{.show()}} on the dataframe, all values will be zero. As soon as I limit the 
> columns on a number < 200, everything looks fine again.
> When the number of columns is < 200 from the beginning, even the cache will 
> not break things and everything works as expected.
> It doesn't matter whether the data is loaded from disk or created on the fly 
> and this happens in Spark 1.6.2 and 2.0.0 (haven't tested anything else).
> Can anyone confirm this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17294) Caching invalidates data on mildly wide dataframes

2016-08-29 Thread Kalle Jepsen (JIRA)
Kalle Jepsen created SPARK-17294:


 Summary: Caching invalidates data on mildly wide dataframes
 Key: SPARK-17294
 URL: https://issues.apache.org/jira/browse/SPARK-17294
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0, 1.6.2
Reporter: Kalle Jepsen


Caching a dataframe with > 200 columns causes the data within to simply vanish 
under certain circumstances.

Consider the following code, where we create a one-row dataframe containing the 
numbers from 0 to 200.

{code}
n_cols = 201
rng = range(n_cols)
df = spark.createDataFrame(
data=[rng]
)

last = df.columns[-1]
print(df.select(last).collect())
df.select(F.greatest(*df.columns).alias('greatest')).show()
{code}

Returns:

{noformat}
[Row(_201=200)]

++
|greatest|
++
| 200|
++
{noformat}

As expected column {{_201}} contains the number 200 and as expected the 
greatest value within that single row is 200.

Now if we introduce a {{.cache}} on {{df}}:

{code}
n_cols = 201
rng = range(n_cols)
df = spark.createDataFrame(
data=[rng]
).cache()

last = df.columns[-1]
print(df.select(last).collect())
df.select(F.greatest(*df.columns).alias('greatest')).show()
{code}

Returns:

{noformat}
[Row(_201=200)]

++
|greatest|
++
|   0|
++
{noformat}

the last column {{_201}} still seems to contain the correct value, but when I 
try to select the greatest value within the row, 0 is returned. When I issue 
{{.show()}} on the dataframe, all values will be zero. As soon as I limit the 
columns on a number < 200, everything looks fine again.

When the number of columns is < 200 from the beginning, even the cache will not 
break things and everything works as expected.

It doesn't matter whether the data is loaded from disk or created on the fly 
and this happens in Spark 1.6.2 and 2.0.0 (haven't tested anything else).

Can anyone confirm this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446059#comment-15446059
 ] 

Takeshi Yamamuro commented on SPARK-17289:
--

yea, I'll check this.

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Blocker
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446013#comment-15446013
 ] 

Herman van Hovell commented on SPARK-17289:
---

cc [~maropu]

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Blocker
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446015#comment-15446015
 ] 

Herman van Hovell commented on SPARK-17289:
---

[~clockfly] Are you working on this one?

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Blocker
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17289:
--
Priority: Blocker  (was: Major)

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Blocker
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17293) seperate view handling from CreateTableCommand

2016-08-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan closed SPARK-17293.
---
Resolution: Invalid

sorry my mistake

> seperate view handling from CreateTableCommand
> --
>
> Key: SPARK-17293
> URL: https://issues.apache.org/jira/browse/SPARK-17293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17293) seperate view handling from CreateTableCommand

2016-08-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-17293:
---

 Summary: seperate view handling from CreateTableCommand
 Key: SPARK-17293
 URL: https://issues.apache.org/jira/browse/SPARK-17293
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-29 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17289:
---
Description: 
For the following query:

{code}
val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
"b").createOrReplaceTempView("t2")
spark.sql("select max(b) from t2 group by a").explain(true)
{code}

Now, the SortAggregator won't insert Sort operator before partial aggregation, 
this will break sort-based partial aggregation.

{code}
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
  +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
max#19])
 +- LocalTableScan [a#5, b#6]
{code}

In Spark 2.0 branch, the plan is:
{code}
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
  +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
max#19])
 +- *Sort [a#5 ASC], false, 0
+- LocalTableScan [a#5, b#6]
{code}

This is related to SPARK-12978

  was:
For the following query:

{code}
val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
"b").createOrReplaceTempView("t2")
spark.sql("select max(b) from t2 group by a").explain(true)
{code}

Now, the SortAggregator won't insert Sort operator before partial aggregation, 
this will break sort-based partial aggregation.

{code}
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
  +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
max#19])
 +- LocalTableScan [a#5, b#6]
{code}

In Spark 2.0 branch, the plan is:
{code}
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
  +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
max#19])
 +- *Sort [a#5 ASC], false, 0
+- LocalTableScan [a#5, b#6]
{code}

This is related with SPARK-12978


> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-29 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15445873#comment-15445873
 ] 

Vincent commented on SPARK-17219:
-

Cool. I will refine the patch. thanks [~srowen] :)

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >