date:20161018

[jira] [Resolved] (SPARK-16559) Got java.lang.ArithmeticException when Num of Buckets is Set to Zero

2016-10-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-16559.
-
Resolution: Fixed

> Got java.lang.ArithmeticException when Num of Buckets is Set to Zero
> 
>
> Key: SPARK-16559
> URL: https://issues.apache.org/jira/browse/SPARK-16559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> Got a run-time java.lang.ArithmeticException when num of buckets is set to 
> zero.
> For example,
> {noformat}
> CREATE TABLE t USING PARQUET
> OPTIONS (PATH '${path.toString}')
> CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS
> AS SELECT 1 AS a, 2 AS b
> {noformat}
> The exception we got is
> {noformat}
> ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 2)
> java.lang.ArithmeticException: / by zero
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18003) RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records.

2016-10-18 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-18003:
---
Description: 
RDD zipWithIndex generate wrong result when one partition contains more than 
Int.MaxValue records.

when RDD contains a partition with more than 2147483647 records, 
error occurs.
for example, if partition-0 has more than 2147483647 records, the index became:
0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 

when we do some operation such as repartition or coalesce, it is possible to 
generate big partition, so this bug should be fixed.


  was:
RDD zipWithIndex generate wrong result when one partition contains more than 
Int.MaxValue records.

when RDD contains a partition with more than 2147483647 records, 
error occurs.
for example, if partition-0 has more than 2147483647 records, the index became:
0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 



> RDD zipWithIndex generate wrong result when one partition contains more than 
> 2147483647 records.
> 
>
> Key: SPARK-18003
> URL: https://issues.apache.org/jira/browse/SPARK-18003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> RDD zipWithIndex generate wrong result when one partition contains more than 
> Int.MaxValue records.
> when RDD contains a partition with more than 2147483647 records, 
> error occurs.
> for example, if partition-0 has more than 2147483647 records, the index 
> became:
> 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 
> when we do some operation such as repartition or coalesce, it is possible to 
> generate big partition, so this bug should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16559) Got java.lang.ArithmeticException when Num of Buckets is Set to Zero

2016-10-18 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587713#comment-15587713
 ] 

Xiao Li commented on SPARK-16559:
-

Yeah, resolved.

> Got java.lang.ArithmeticException when Num of Buckets is Set to Zero
> 
>
> Key: SPARK-16559
> URL: https://issues.apache.org/jira/browse/SPARK-16559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Minor
>
> Got a run-time java.lang.ArithmeticException when num of buckets is set to 
> zero.
> For example,
> {noformat}
> CREATE TABLE t USING PARQUET
> OPTIONS (PATH '${path.toString}')
> CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS
> AS SELECT 1 AS a, 2 AS b
> {noformat}
> The exception we got is
> {noformat}
> ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 2)
> java.lang.ArithmeticException: / by zero
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16559) Got java.lang.ArithmeticException when Num of Buckets is Set to Zero

2016-10-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-16559:
---

Assignee: Xiao Li

> Got java.lang.ArithmeticException when Num of Buckets is Set to Zero
> 
>
> Key: SPARK-16559
> URL: https://issues.apache.org/jira/browse/SPARK-16559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> Got a run-time java.lang.ArithmeticException when num of buckets is set to 
> zero.
> For example,
> {noformat}
> CREATE TABLE t USING PARQUET
> OPTIONS (PATH '${path.toString}')
> CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS
> AS SELECT 1 AS a, 2 AS b
> {noformat}
> The exception we got is
> {noformat}
> ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 2)
> java.lang.ArithmeticException: / by zero
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16559) Got java.lang.ArithmeticException when Num of Buckets is Set to Zero

2016-10-18 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587711#comment-15587711
 ] 

Dongjoon Hyun commented on SPARK-16559:
---

Hi, [~smilegator].
It seems to be resolved in your PR, didn't it?

> Got java.lang.ArithmeticException when Num of Buckets is Set to Zero
> 
>
> Key: SPARK-16559
> URL: https://issues.apache.org/jira/browse/SPARK-16559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Minor
>
> Got a run-time java.lang.ArithmeticException when num of buckets is set to 
> zero.
> For example,
> {noformat}
> CREATE TABLE t USING PARQUET
> OPTIONS (PATH '${path.toString}')
> CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS
> AS SELECT 1 AS a, 2 AS b
> {noformat}
> The exception we got is
> {noformat}
> ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 2)
> java.lang.ArithmeticException: / by zero
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18003) RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records.

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18003:


Assignee: Apache Spark

> RDD zipWithIndex generate wrong result when one partition contains more than 
> 2147483647 records.
> 
>
> Key: SPARK-18003
> URL: https://issues.apache.org/jira/browse/SPARK-18003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> RDD zipWithIndex generate wrong result when one partition contains more than 
> Int.MaxValue records.
> when RDD contains a partition with more than 2147483647 records, 
> error occurs.
> for example, if partition-0 has more than 2147483647 records, the index 
> became:
> 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18003) RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records.

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18003:


Assignee: (was: Apache Spark)

> RDD zipWithIndex generate wrong result when one partition contains more than 
> 2147483647 records.
> 
>
> Key: SPARK-18003
> URL: https://issues.apache.org/jira/browse/SPARK-18003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> RDD zipWithIndex generate wrong result when one partition contains more than 
> Int.MaxValue records.
> when RDD contains a partition with more than 2147483647 records, 
> error occurs.
> for example, if partition-0 has more than 2147483647 records, the index 
> became:
> 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18003) RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records.

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587706#comment-15587706
 ] 

Apache Spark commented on SPARK-18003:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/15550

> RDD zipWithIndex generate wrong result when one partition contains more than 
> 2147483647 records.
> 
>
> Key: SPARK-18003
> URL: https://issues.apache.org/jira/browse/SPARK-18003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> RDD zipWithIndex generate wrong result when one partition contains more than 
> Int.MaxValue records.
> when RDD contains a partition with more than 2147483647 records, 
> error occurs.
> for example, if partition-0 has more than 2147483647 records, the index 
> became:
> 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18003) RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records.

2016-10-18 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-18003:
---
Component/s: Spark Core

> RDD zipWithIndex generate wrong result when one partition contains more than 
> 2147483647 records.
> 
>
> Key: SPARK-18003
> URL: https://issues.apache.org/jira/browse/SPARK-18003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> RDD zipWithIndex generate wrong result when one partition contains more than 
> Int.MaxValue records.
> when RDD contains a partition with more than 2147483647 records, 
> error occurs.
> for example, if partition-0 has more than 2147483647 records, the index 
> became:
> 0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18003) RDD zipWithIndex generate wrong result when one partition contains more than 2147483647 records.

2016-10-18 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-18003:
--

 Summary: RDD zipWithIndex generate wrong result when one partition 
contains more than 2147483647 records.
 Key: SPARK-18003
 URL: https://issues.apache.org/jira/browse/SPARK-18003
 Project: Spark
  Issue Type: Bug
Reporter: Weichen Xu


RDD zipWithIndex generate wrong result when one partition contains more than 
Int.MaxValue records.

when RDD contains a partition with more than 2147483647 records, 
error occurs.
for example, if partition-0 has more than 2147483647 records, the index became:
0,1, ..., 2147483647, -2147483648, -2147483647, -2147483646 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17630) jvm-exit-on-fatal-error handler for spark.rpc.netty like there is available for akka

2016-10-18 Thread Mario Briggs (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587664#comment-15587664
 ] 

Mario Briggs commented on SPARK-17630:
--

[~zsxwing] thanks much. any pointers on how/where to add code or something 
existing in the code base to look. I can then try a PR

> jvm-exit-on-fatal-error handler for spark.rpc.netty like there is available 
> for akka
> 
>
> Key: SPARK-17630
> URL: https://issues.apache.org/jira/browse/SPARK-17630
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Mario Briggs
> Attachments: SecondCodePath.txt, firstCodepath.txt
>
>
> Hi,
> I have 2 code-paths from my app that result in a jvm OOM. 
> In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts 
> down the JVM, so that the caller (py4J) get notified with proper stack trace. 
> Attached stack-trace file (firstCodepath.txt)
> In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the 
> JVM, so the caller does not get notified. 
> Attached stack-trace file (SecondCodepath.txt)
> Is it possible to have an jvm exit handle for the rpc. netty path?
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587659#comment-15587659
 ] 

Jiang Xingbo commented on SPARK-17982:
--

You are right, just made a mistake. Sorry for that!

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> {noformat}
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> {noformat}
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-18 Thread Saikat Kanjilal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587627#comment-15587627
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 10/19/16 4:24 AM:
--

[~holdenk] finally getting time to look at this, so I am starting small, I made 
the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] 
tp local[4], per the documentation here 
(http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
 I ran mvn -P hadoop2 -Dsuites= org.apache.spark.HeartbeatReceiverSuite test--- 
looks like everything worked

I then ran mvn -P hadoop2 -Dsuites= org.apache.spark.ContextCleanerSuite test-- 
looks like everything worked as well

See the attachments and let me know if this is the right process to run single 
unit tests,  I'll start making changes to the other Suites , how would you like 
to see the output, should I just have attachments or just do a pull request 
from the new branch that I created?
Thanks

PS
Another question, running single unit tests like this takes forever, are there 
flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD 
the builds shouldnt take this long :(.  


Let me know next steps to get this into a PR.






was (Author: kanjilal):
[~holdenk] finally getting time to look at this, so I am starting small, I made 
the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] 
tp local[4], per the documentation here 
(http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
 I ran mvn -P hadoop2 -Dsuites= org.apache.spark.HeartbeatReceiverSuite test--- 
looks like everything worked

I then ran mvn -P hadoop2 -Dsuites= org.apache.spark.ContextCleanerSuite test-- 
looks like everything worked as well

See the attachments and let me know if this is this is the right process to run 
single unit tests, if not I'll start making changes to the other Suites , how 
would you like to see the output, should I just have attachments or just do a 
pull request from the new branch that I created?
Thanks

PS
Another question, running single unit tests like this takes forever, are there 
flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD 
the builds shouldnt take this long :(.  


Let me know next steps to get this into a PR.





> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-18 Thread Saikat Kanjilal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587627#comment-15587627
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 10/19/16 4:24 AM:
--

[~holdenk] finally getting time to look at this, so I am starting small, I made 
the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] 
tp local[4], per the documentation here 
(http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
 I ran mvn -P hadoop2 -Dsuites= org.apache.spark.HeartbeatReceiverSuite test--- 
looks like everything worked

I then ran mvn -P hadoop2 -Dsuites= org.apache.spark.ContextCleanerSuite test-- 
looks like everything worked as well

See the attachments and let me know if this is this is the right process to run 
single unit tests, if not I'll start making changes to the other Suites , how 
would you like to see the output, should I just have attachments or just do a 
pull request from the new branch that I created?
Thanks

PS
Another question, running single unit tests like this takes forever, are there 
flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD 
the builds shouldnt take this long :(.  


Let me know next steps to get this into a PR.






was (Author: kanjilal):
[~holdenk] finally getting time to look at this, so I am starting small, I made 
the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] 
tp local[4], per the documentation here 
(http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
 I ran mvn -Phadoop2 -Dsuites=org.apache.spark.HeartbeatReceiverSuite test--- 
looks like everything worked

I then ran mvn -Phadoop2 -Dsuites=org.apache.spark.ContextCleanerSuite test-- 
looks like everything worked as well

See the attachments and let me know if this is not the right process to run 
single unit tests, if not I'll start making changes to the other Suites , how 
would you like to see the output, should I just have attachments or just do a 
pull request from the new branch that I created?
Thanks

PS
Another question, running single unit tests like this takes forever, are there 
flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD 
the builds shouldnt take this long :(.  


Let me know next steps to get this into a PR.





> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-18 Thread Saikat Kanjilal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587627#comment-15587627
 ] 

Saikat Kanjilal commented on SPARK-9487:


[~holdenk] finally getting time to look at this, so I am starting small, I made 
the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] 
tp local[4], per the documentation here 
(http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
 I ran mvn -Phadoop2 -Dsuites=org.apache.spark.HeartbeatReceiverSuite test--- 
looks like everything worked

I then ran mvn -Phadoop2 -Dsuites=org.apache.spark.ContextCleanerSuite test-- 
looks like everything worked as well

See the attachments and let me know if this is not the right process to run 
single unit tests, if not I'll start making changes to the other Suites , how 
would you like to see the output, should I just have attachments or just do a 
pull request from the new branch that I created?
Thanks

PS
Another question, running single unit tests like this takes forever, are there 
flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD 
the builds shouldnt take this long :(.  


Let me know next steps to get this into a PR.





> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17985) Bump commons-lang3 version to 3.5.

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587625#comment-15587625
 ] 

Apache Spark commented on SPARK-17985:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15548

> Bump commons-lang3 version to 3.5.
> --
>
> Key: SPARK-17985
> URL: https://issues.apache.org/jira/browse/SPARK-17985
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.1.0
>
>
> {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break 
> thread safety, which gets stack sometimes caused by race condition of 
> initializing hash map.
> See https://issues.apache.org/jira/browse/LANG-1251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18002) Prune unnecessary IsNotNull predicates from Filter

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587624#comment-15587624
 ] 

Apache Spark commented on SPARK-18002:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/15547

> Prune unnecessary IsNotNull predicates from Filter
> --
>
> Key: SPARK-18002
> URL: https://issues.apache.org/jira/browse/SPARK-18002
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> In PruneFilters rule, we can prune unnecessary IsNotNull predicates if the 
> predicate in IsNotNull is not nullable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-18 Thread Saikat Kanjilal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saikat Kanjilal updated SPARK-9487:
---
Attachment: ContextCleanerSuiteResults

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18002) Prune unnecessary IsNotNull predicates from Filter

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18002:


Assignee: Apache Spark

> Prune unnecessary IsNotNull predicates from Filter
> --
>
> Key: SPARK-18002
> URL: https://issues.apache.org/jira/browse/SPARK-18002
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> In PruneFilters rule, we can prune unnecessary IsNotNull predicates if the 
> predicate in IsNotNull is not nullable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18002) Prune unnecessary IsNotNull predicates from Filter

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18002:


Assignee: (was: Apache Spark)

> Prune unnecessary IsNotNull predicates from Filter
> --
>
> Key: SPARK-18002
> URL: https://issues.apache.org/jira/browse/SPARK-18002
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> In PruneFilters rule, we can prune unnecessary IsNotNull predicates if the 
> predicate in IsNotNull is not nullable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-18 Thread Saikat Kanjilal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saikat Kanjilal updated SPARK-9487:
---
Attachment: HeartbeatReceiverSuiteResults

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18001) Broke link to R DataFrame In sql-programming-guide

2016-10-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18001.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

> Broke link to R DataFrame In sql-programming-guide 
> ---
>
> Key: SPARK-18001
> URL: https://issues.apache.org/jira/browse/SPARK-18001
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Tommy Yu
>Priority: Trivial
> Fix For: 2.0.2, 2.1.0
>
>
> In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section 
> "Untyped Dataset Operations (aka DataFrame Operations)"
> Link to R doesn't work that return 
> The requested URL /docs/latest/api/R/DataFrame.html was not found on this 
> server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18001) Broke link to R DataFrame In sql-programming-guide

2016-10-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18001:

Assignee: Tommy Yu

> Broke link to R DataFrame In sql-programming-guide 
> ---
>
> Key: SPARK-18001
> URL: https://issues.apache.org/jira/browse/SPARK-18001
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Tommy Yu
>Assignee: Tommy Yu
>Priority: Trivial
> Fix For: 2.0.2, 2.1.0
>
>
> In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section 
> "Untyped Dataset Operations (aka DataFrame Operations)"
> Link to R doesn't work that return 
> The requested URL /docs/latest/api/R/DataFrame.html was not found on this 
> server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18002) Prune unnecessary IsNotNull predicates from Filter

2016-10-18 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-18002:
---

 Summary: Prune unnecessary IsNotNull predicates from Filter
 Key: SPARK-18002
 URL: https://issues.apache.org/jira/browse/SPARK-18002
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


In PruneFilters rule, we can prune unnecessary IsNotNull predicates if the 
predicate in IsNotNull is not nullable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587583#comment-15587583
 ] 

Dongjoon Hyun commented on SPARK-17982:
---

Hi, [~tafra...@gmail.com].
I made a PR for you. It handles your case, too.
{code}
scala> sql("create table `where`(id int, name int)")
res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where 
(WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` 
limit 2")
res2: org.apache.spark.sql.DataFrame = []
{code}
I just simplify your case because it contains multiple test cases.

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> {noformat}
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> {noformat}
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587579#comment-15587579
 ] 

Dongjoon Hyun commented on SPARK-17982:
---

Hi, [~jiangxb].
I think Spark supports that like MySQL 
http://dev.mysql.com/doc/refman/5.7/en/create-view.html .
{code}
scala> sql("SELECT id2 FROM v1")
res0: org.apache.spark.sql.DataFrame = [id2: int]
{code}



> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> {noformat}
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> {noformat}
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17982:


Assignee: (was: Apache Spark)

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> {noformat}
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> {noformat}
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587573#comment-15587573
 ] 

Apache Spark commented on SPARK-17982:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15546

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> {noformat}
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> {noformat}
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)

2016-10-18 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587574#comment-15587574
 ] 

Dongjoon Hyun commented on SPARK-17892:
---

Sorry. It's my mistake.

> Query in CTAS is Optimized Twice (branch-2.0)
> -
>
> Key: SPARK-17892
> URL: https://issues.apache.org/jira/browse/SPARK-17892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.0.2
>
>
> This tracks the work that fixes the problem shown in  
> https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17982:


Assignee: Apache Spark

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>Assignee: Apache Spark
>
> The following statement fails in the spark shell . 
> {noformat}
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> {noformat}
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587567#comment-15587567
 ] 

Apache Spark commented on SPARK-17892:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15546

> Query in CTAS is Optimized Twice (branch-2.0)
> -
>
> Key: SPARK-17892
> URL: https://issues.apache.org/jira/browse/SPARK-17892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.0.2
>
>
> This tracks the work that fixes the problem shown in  
> https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17999) Add getPreferredLocations for KafkaSourceRDD

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17999:


Assignee: (was: Apache Spark)

> Add getPreferredLocations for KafkaSourceRDD
> 
>
> Key: SPARK-17999
> URL: https://issues.apache.org/jira/browse/SPARK-17999
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Reporter: Saisai Shao
>Priority: Minor
>
> The newly implemented Structured Streaming KafkaSource did calculate the 
> preferred locations for each topic partition, but didn't offer this 
> information through RDD's {{getPreferredLocations}} method. So here propose 
> to add this method in {{KafkaSourceRDD}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17999) Add getPreferredLocations for KafkaSourceRDD

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17999:


Assignee: Apache Spark

> Add getPreferredLocations for KafkaSourceRDD
> 
>
> Key: SPARK-17999
> URL: https://issues.apache.org/jira/browse/SPARK-17999
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> The newly implemented Structured Streaming KafkaSource did calculate the 
> preferred locations for each topic partition, but didn't offer this 
> information through RDD's {{getPreferredLocations}} method. So here propose 
> to add this method in {{KafkaSourceRDD}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17997:


Assignee: (was: Apache Spark)

> Aggregation function for counting distinct values for multiple intervals
> 
>
> Key: SPARK-17997
> URL: https://issues.apache.org/jira/browse/SPARK-17997
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> This is for computing ndv's for bins in equi-height histograms. A bin 
> consists of two endpoints which form an interval of values and the ndv in 
> that interval. For computing histogram statistics, after getting the 
> endpoints, we need an agg function to count distinct values in each interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17997:


Assignee: Apache Spark

> Aggregation function for counting distinct values for multiple intervals
> 
>
> Key: SPARK-17997
> URL: https://issues.apache.org/jira/browse/SPARK-17997
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> This is for computing ndv's for bins in equi-height histograms. A bin 
> consists of two endpoints which form an interval of values and the ndv in 
> that interval. For computing histogram statistics, after getting the 
> endpoints, we need an agg function to count distinct values in each interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587548#comment-15587548
 ] 

Apache Spark commented on SPARK-17997:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/15544

> Aggregation function for counting distinct values for multiple intervals
> 
>
> Key: SPARK-17997
> URL: https://issues.apache.org/jira/browse/SPARK-17997
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> This is for computing ndv's for bins in equi-height histograms. A bin 
> consists of two endpoints which form an interval of values and the ndv in 
> that interval. For computing histogram statistics, after getting the 
> endpoints, we need an agg function to count distinct values in each interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18001) Broke link to R DataFrame In sql-programming-guide

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18001:


Assignee: Apache Spark

> Broke link to R DataFrame In sql-programming-guide 
> ---
>
> Key: SPARK-18001
> URL: https://issues.apache.org/jira/browse/SPARK-18001
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Tommy Yu
>Assignee: Apache Spark
>Priority: Trivial
>
> In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section 
> "Untyped Dataset Operations (aka DataFrame Operations)"
> Link to R doesn't work that return 
> The requested URL /docs/latest/api/R/DataFrame.html was not found on this 
> server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18001) Broke link to R DataFrame In sql-programming-guide

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18001:


Assignee: (was: Apache Spark)

> Broke link to R DataFrame In sql-programming-guide 
> ---
>
> Key: SPARK-18001
> URL: https://issues.apache.org/jira/browse/SPARK-18001
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Tommy Yu
>Priority: Trivial
>
> In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section 
> "Untyped Dataset Operations (aka DataFrame Operations)"
> Link to R doesn't work that return 
> The requested URL /docs/latest/api/R/DataFrame.html was not found on this 
> server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18001) Broke link to R DataFrame In sql-programming-guide

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587539#comment-15587539
 ] 

Apache Spark commented on SPARK-18001:
--

User 'Wenpei' has created a pull request for this issue:
https://github.com/apache/spark/pull/15543

> Broke link to R DataFrame In sql-programming-guide 
> ---
>
> Key: SPARK-18001
> URL: https://issues.apache.org/jira/browse/SPARK-18001
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Tommy Yu
>Priority: Trivial
>
> In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section 
> "Untyped Dataset Operations (aka DataFrame Operations)"
> Link to R doesn't work that return 
> The requested URL /docs/latest/api/R/DataFrame.html was not found on this 
> server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17873) ALTER TABLE ... RENAME TO ... should allow users to specify database in destination table name

2016-10-18 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17873.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

This issue has been resolved by https://github.com/apache/spark/pull/15434. 

> ALTER TABLE ... RENAME TO ... should allow users to specify database in 
> destination table name
> --
>
> Key: SPARK-17873
> URL: https://issues.apache.org/jira/browse/SPARK-17873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17987) ML Evaluator fails to handle null values in the dataset

2016-10-18 Thread bo song (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587530#comment-15587530
 ] 

bo song commented on SPARK-17987:
-

There is a common case in cross validation, suppose f1 is a categorical 
predictor, its categories are (0,1,2,3,4). As you all know, cross validation 
splits data into training and testing data sets randomly, suppose the training 
data contains only (0,1,2) for f1, when the testing data do forecasts for (3, 
4), almost algorithms could produce null predictions for this case. 

I would like to introduce an option into Spark, its default behavior is still 
an exception thrown for missing null values, caller can change it to exclude 
missing values explicitly, he knows the changes/risks and wants an result 
instead of an exception.


> ML Evaluator fails to handle null values in the dataset
> ---
>
> Key: SPARK-17987
> URL: https://issues.apache.org/jira/browse/SPARK-17987
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.2, 2.0.1
>Reporter: bo song
>
> Take the RegressionEvaluator as an example, when the predictionCol is null in 
> a row, en exception "scala.MatchEror" will be thrown. The missing null 
> prediction is a common case, for example when an predictor is missing, or its 
> value is out of bound, almost machine learning models could not produce 
> correct predictions, then null predictions would be returned. Evaluators 
> should handle the null values instead of an exception thrown, the common way 
> to handle missing null values is to ignore them. Besides of the null value, 
> the NAN value need to be handled correctly too. 
> Those three evaluators RegressionEvaluator, BinaryClassificationEvaluator and 
> MulticlassClassificationEvaluator have the same problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17873) ALTER TABLE ... RENAME TO ... should allow users to specify database in destination table name

2016-10-18 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17873:
-
Assignee: Wenchen Fan  (was: Apache Spark)

> ALTER TABLE ... RENAME TO ... should allow users to specify database in 
> destination table name
> --
>
> Key: SPARK-17873
> URL: https://issues.apache.org/jira/browse/SPARK-17873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18001) Broke link to R DataFrame In sql-programming-guide

2016-10-18 Thread Tommy Yu (JIRA)

Tommy Yu created SPARK-18001:


 Summary: Broke link to R DataFrame In sql-programming-guide 
 Key: SPARK-18001
 URL: https://issues.apache.org/jira/browse/SPARK-18001
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.0.1
Reporter: Tommy Yu
Priority: Trivial


In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section 
"Untyped Dataset Operations (aka DataFrame Operations)"

Link to R doesn't work that return 
The requested URL /docs/latest/api/R/DataFrame.html was not found on this 
server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17074) generate histogram information for column

2016-10-18 Thread Zhenhua Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17074:
-
Description: 
We support two kinds of histograms: 
-   Equi-width histogram: We have a fixed width for each column interval in 
the histogram.  The height of a histogram represents the frequency for those 
column values in a specific interval.  For this kind of histogram, its height 
varies for different column intervals. We use the equi-width histogram when the 
number of distinct values is less than 254.
-   Equi-height histogram: For this histogram, the width of column interval 
varies.  The heights of all column intervals are the same.  The equi-height 
histogram is effective in handling skewed data distribution. We use the equi- 
height histogram when the number of distinct values is equal to or greater than 
254.  

We first use [SPARK-18000] and [SPARK-17881] to compute equi-width histograms 
(for both numeric and string types) or endpoints of equi-height histograms (for 
numeric type only). Then, if we get endpoints of a equi-height histogram, we 
need to compute ndv's between those endpoints by [SPARK-17997] to form the 
equi-height histogram.

This Jira incorporates three Jiras mentioned above to support needed 
aggregation functions. We need to resolve them before this one.

  was:
We support two kinds of histograms: 
-   Equi-width histogram: We have a fixed width for each column interval in 
the histogram.  The height of a histogram represents the frequency for those 
column values in a specific interval.  For this kind of histogram, its height 
varies for different column intervals. We use the equi-width histogram when the 
number of distinct values is less than 254.
-   Equi-height histogram: For this histogram, the width of column interval 
varies.  The heights of all column intervals are the same.  The equi-height 
histogram is effective in handling skewed data distribution. We use the equi- 
height histogram when the number of distinct values is equal to or greater than 
254.  



> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  
> We first use [SPARK-18000] and [SPARK-17881] to compute equi-width histograms 
> (for both numeric and string types) or endpoints of equi-height histograms 
> (for numeric type only). Then, if we get endpoints of a equi-height 
> histogram, we need to compute ndv's between those endpoints by [SPARK-17997] 
> to form the equi-height histogram.
> This Jira incorporates three Jiras mentioned above to support needed 
> aggregation functions. We need to resolve them before this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17984) Add support for numa aware feature

2016-10-18 Thread quanfuwang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587472#comment-15587472
 ] 

quanfuwang commented on SPARK-17984:


Thanks. Yes, I'm considering not to bring the dependency on numactl, but to 
find a general way.

> Add support for numa aware feature
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtasks and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18000) Aggregation function for computing endpoints for numeric histograms

2016-10-18 Thread Zhenhua Wang (JIRA)

Zhenhua Wang created SPARK-18000:


 Summary: Aggregation function for computing endpoints for numeric 
histograms
 Key: SPARK-18000
 URL: https://issues.apache.org/jira/browse/SPARK-18000
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhenhua Wang


For a column of numeric type (including date and timestamp), we will generate a 
equi-width or equi-height histogram, depending on if its ndv is large than the 
maximum number of bins allowed in one histogram (denoted as numBins).
This agg function computes values and their frequencies using a small hashmap, 
whose size is less than or equal to "numBins", and returns an equi-width 
histogram. 
When the size of hashmap exceeds "numBins", it cleans the hashmap and utilizes 
ApproximatePercentile to return endpoints of equi-height histogram.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17999) Add getPreferredLocations for KafkaSourceRDD

2016-10-18 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-17999:
---

 Summary: Add getPreferredLocations for KafkaSourceRDD
 Key: SPARK-17999
 URL: https://issues.apache.org/jira/browse/SPARK-17999
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Streaming
Reporter: Saisai Shao
Priority: Minor


The newly implemented Structured Streaming KafkaSource did calculate the 
preferred locations for each topic partition, but didn't offer this information 
through RDD's {{getPreferredLocations}} method. So here propose to add this 
method in {{KafkaSourceRDD}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2016-10-18 Thread Angus Gerry (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587461#comment-15587461
 ] 

Angus Gerry commented on SPARK-10872:
-

Will do. Thanks mate :).

> Derby error (XSDB6) when creating new HiveContext after restarting 
> SparkContext
> ---
>
> Key: SPARK-10872
> URL: https://issues.apache.org/jira/browse/SPARK-10872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Dmytro Bielievtsov
>
> Starting from spark 1.4.0 (works well on 1.3.1), the following code fails 
> with "XSDB6: Another instance of Derby may have already booted the database 
> ~/metastore_db":
> {code:python}
> from pyspark import SparkContext, HiveContext
> sc = SparkContext("local[*]", "app1")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()
> sc.stop()
> sc = SparkContext("local[*]", "app2")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()  # Py4J error
> {code}
> This is related to [#SPARK-9539], and I intend to restart spark context 
> several times for isolated jobs to prevent cache cluttering and GC errors.
> Here's a larger part of the full error trace:
> {noformat}
> Failed to start database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
> org.datanucleus.exceptions.NucleusDataStoreException: Failed to start 
> database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
>   at 
>

[jira] [Commented] (SPARK-17984) Add support for numa aware feature

2016-10-18 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587459#comment-15587459
 ] 

Saisai Shao commented on SPARK-17984:
-

NUMA should be supported by most commodity servers as well as HPC. But 
{{numactl}} may not be installed by default in most OSes. Also other systems 
like Windows or Mac may not have equal tools, please be considered.

> Add support for numa aware feature
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtasks and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17982:
--
Description: 
The following statement fails in the spark shell . 
{noformat}
scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where 
(WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` 
limit 2")

scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where 
(WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` 
limit 2")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
`gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
`gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
`gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
`gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
gen_subquery_1
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
  at org.apache.spark.sql.Dataset.(Dataset.scala:186)
  at org.apache.spark.sql.Dataset.(Dataset.scala:167)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
{noformat}
This appears to be a limitation of the create view statement .


  was:
The following statement fails in the spark shell . 

scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where 
(WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` 
limit 2")

scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where 
(WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` 
limit 2")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
`gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
`gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
`gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
`gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
gen_subquery_1
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
  at org.apache.spark.sql.Dataset.(Dataset.scala:186)
  at org.apache.spark.sql.Dataset.(Dataset.scala:167)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)


This appears to be a limitation of the create view statement .



> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
>

[jira] [Created] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

2016-10-18 Thread Shea Parkes (JIRA)

Shea Parkes created SPARK-17998:
---

 Summary: Reading Parquet files coalesces parts into too few 
in-memory partitions
 Key: SPARK-17998
 URL: https://issues.apache.org/jira/browse/SPARK-17998
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.0.1, 2.0.0
 Environment: Spark Standalone Cluster (not "local mode")
Windows 10 and Windows 7
Python 3.x
Reporter: Shea Parkes


Reading a parquet ~file into a DataFrame is resulting in far too few in-memory 
partitions.  In prior versions of Spark, the resulting DataFrame would have a 
number of partitions often equal to the number of parts in the parquet folder.

Here's a minimal reproducible sample:

{quote}
df_first = session.range(start=1, end=1, numPartitions=13)
assert df_first.rdd.getNumPartitions() == 13
assert session._sc.defaultParallelism == 6

path_scrap = r"c:\scratch\scrap.parquet"
df_first.write.parquet(path_scrap)
df_second = session.read.parquet(path_scrap)

print(df_second.rdd.getNumPartitions())
{quote}

The above shows only 7 partitions in the DataFrame that was created by reading 
the Parquet back into memory for me.  Why is it no longer just the number of 
part files in the Parquet folder?  (Which is 13 in the example above.)

I'm filing this as a bug because it has gotten so bad that we can't work with 
the underlying RDD without first repartitioning the DataFrame, which is costly 
and wasteful.  I really doubt this was the intended effect of moving to Spark 
2.0.

I've tried to research where the number of in-memory partitions is determined, 
but my Scala skills have proven in-adequate.  I'd be happy to dig further if 
someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587433#comment-15587433
 ] 

Jiang Xingbo commented on SPARK-17982:
--

Would you please give a example that it works? Thanks!

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals

2016-10-18 Thread Zhenhua Wang (JIRA)

Zhenhua Wang created SPARK-17997:


 Summary: Aggregation function for counting distinct values for 
multiple intervals
 Key: SPARK-17997
 URL: https://issues.apache.org/jira/browse/SPARK-17997
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhenhua Wang


This is for computing ndv's for bins in equi-height histograms. A bin consists 
of two endpoints which form an interval of values and the ndv in that interval. 
For computing histogram statistics, after getting the endpoints, we need an agg 
function to count distinct values in each interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17996) catalog.getFunction(name) returns wrong result for a permanent function

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17996:


Assignee: Herman van Hovell  (was: Apache Spark)

> catalog.getFunction(name) returns wrong result for a permanent function
> ---
>
> Key: SPARK-17996
> URL: https://issues.apache.org/jira/browse/SPARK-17996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>
> The catalog returns a wrong result, if we lookup a permanent function without 
> specifying the database. For example:
> {noformat}
> scala> sql("create function fn1 as 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs'")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.getFunction("fn1")
> res1: org.apache.spark.sql.catalog.Function = Function[name='fn1', 
> className='org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs', 
> isTemporary='true']
> {noformat}
> It should not return that this function is temporary and define a database.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17996) catalog.getFunction(name) returns wrong result for a permanent function

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587425#comment-15587425
 ] 

Apache Spark commented on SPARK-17996:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15542

> catalog.getFunction(name) returns wrong result for a permanent function
> ---
>
> Key: SPARK-17996
> URL: https://issues.apache.org/jira/browse/SPARK-17996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>
> The catalog returns a wrong result, if we lookup a permanent function without 
> specifying the database. For example:
> {noformat}
> scala> sql("create function fn1 as 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs'")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.getFunction("fn1")
> res1: org.apache.spark.sql.catalog.Function = Function[name='fn1', 
> className='org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs', 
> isTemporary='true']
> {noformat}
> It should not return that this function is temporary and define a database.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17996) catalog.getFunction(name) returns wrong result for a permanent function

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17996:


Assignee: Apache Spark  (was: Herman van Hovell)

> catalog.getFunction(name) returns wrong result for a permanent function
> ---
>
> Key: SPARK-17996
> URL: https://issues.apache.org/jira/browse/SPARK-17996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Minor
>
> The catalog returns a wrong result, if we lookup a permanent function without 
> specifying the database. For example:
> {noformat}
> scala> sql("create function fn1 as 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs'")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.getFunction("fn1")
> res1: org.apache.spark.sql.catalog.Function = Function[name='fn1', 
> className='org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs', 
> isTemporary='true']
> {noformat}
> It should not return that this function is temporary and define a database.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587420#comment-15587420
 ] 

Jiang Xingbo edited comment on SPARK-17982 at 10/19/16 2:22 AM:


[~dongjoon] In your examples there is a misleading part:
{code}
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl")
res2: org.apache.spark.sql.DataFrame = []
{code}
The above "(id2)" in "v1(id2)" is in fact an identifierCommentList instead of 
colTypeList, so it is not actually creating columns accord.

Perhaps we should listen to [~hvanhovell] whether we should support specify 
columns in CreateView? 


was (Author: jiangxb1987):
[~dongjoon] In your examples there is a misleading part:
{code}
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl")
res2: org.apache.spark.sql.DataFrame = []
{code}
The above "(id2)" in "v1(id2)" is infact an identifierCommentList instead of 
colTypeList, so it is not actually creating columns accord.

Perhaps we should listen to [~hvanhovell] whether we should support specify 
columns in CreateView? 

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17980) Fix refreshByPath for converted Hive tables

2016-10-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17980:

Assignee: Eric Liang

> Fix refreshByPath for converted Hive tables
> ---
>
> Key: SPARK-17980
> URL: https://issues.apache.org/jira/browse/SPARK-17980
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> There is a small bug introduced in https://github.com/apache/spark/pull/14690 
> which broke refreshByPath with converted hive tables (though, it turns out it 
> was very difficult to refresh converted hive tables anyways, since you had to 
> specify the exact path of one of the partitions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17980) Fix refreshByPath for converted Hive tables

2016-10-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17980.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15521
[https://github.com/apache/spark/pull/15521]

> Fix refreshByPath for converted Hive tables
> ---
>
> Key: SPARK-17980
> URL: https://issues.apache.org/jira/browse/SPARK-17980
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> There is a small bug introduced in https://github.com/apache/spark/pull/14690 
> which broke refreshByPath with converted hive tables (though, it turns out it 
> was very difficult to refresh converted hive tables anyways, since you had to 
> specify the exact path of one of the partitions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587420#comment-15587420
 ] 

Jiang Xingbo commented on SPARK-17982:
--

[~dongjoon] In your examples there is a misleading part:
{code}
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl")
res2: org.apache.spark.sql.DataFrame = []
{code}
The above "(id2)" in "v1(id2)" is infact an identifierCommentList instead of 
colTypeList, so it is not actually creating columns accord.

Perhaps we should listen to [~hvanhovell] whether we should support specify 
columns in CreateView? 

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17984) Add support for numa aware feature

2016-10-18 Thread quanfuwang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587374#comment-15587374
 ] 

quanfuwang commented on SPARK-17984:


Normal servers support NUMA, but I'm considering to abstract it away which does 
not bring dependency on numactl.
Yes, one can't always find numactl.

> Add support for numa aware feature
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtasks and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17996) catalog.getFunction(name) returns wrong result for a permanent function

2016-10-18 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-17996:
-

 Summary: catalog.getFunction(name) returns wrong result for a 
permanent function
 Key: SPARK-17996
 URL: https://issues.apache.org/jira/browse/SPARK-17996
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Herman van Hovell
Assignee: Herman van Hovell
Priority: Minor


The catalog returns a wrong result, if we lookup a permanent function without 
specifying the database. For example:
{noformat}
scala> sql("create function fn1 as 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs'")
res0: org.apache.spark.sql.DataFrame = []

scala> spark.catalog.getFunction("fn1")
res1: org.apache.spark.sql.catalog.Function = Function[name='fn1', 
className='org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs', 
isTemporary='true']
{noformat}

It should not return that this function is temporary and define a database.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17995) Use new attributes for columns from outer joins

2016-10-18 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587348#comment-15587348
 ] 

Wenchen Fan edited comment on SPARK-17995 at 10/19/16 1:46 AM:
---

Can we just add a `newOuterJoinAttrs: Seq[Attribute]` parameter in the `Join` 
class? e.g.
{code}
case class Join(..., newOuterJoinAttrs: Seq[Attribute]) {
  def output = joinType match {
case LeftOuterJoin => left.output ++ newOuterJoinAttrs
  }
}

object Join {
  def apply(...) = {
val newOuterJoinAttrs = joinType match {
  case LeftOuterJoin => right.output.map(_.newInstance)
}
Join(..., newOuterJoinAttrs)
  }
}
{code}


was (Author: cloud_fan):
Can we just add a `newOuterJoinAttrs: Seq[Attribute]` parameter in the `Join` 
class? e.g.
{code}
case class Join(..., newOuterJoinAttrs: Seq[Attribute])

object Join {
  def apply(...) = {
val newOuterJoinAttrs = joinType match {
  case LeftOuterJoin => right.output.map(_.newInstance)
}
  }
}
{code}

> Use new attributes for columns from outer joins
> ---
>
> Key: SPARK-17995
> URL: https://issues.apache.org/jira/browse/SPARK-17995
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
>Reporter: Ryan Blue
>
> Plans involving outer joins use the same attribute reference (by exprId) to 
> reference columns above the join and below the join. This is a false 
> equivalence that leads to bugs like SPARK-16181, in which an attributes were 
> incorrectly replaced by the optimizer. The column has a different schema 
> above the outer join because its values may be null. The fix for that issue, 
> [PR #13884](https://github.com/apache/spark/pull/13884) has a TODO comment 
> from [~cloud_fan] to fix this by using different attributes instead of 
> needing to special-case outer joins in rules and this issue is to track that 
> improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17995) Use new attributes for columns from outer joins

2016-10-18 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587348#comment-15587348
 ] 

Wenchen Fan commented on SPARK-17995:
-

Can we just add a `newOuterJoinAttrs: Seq[Attribute]` parameter in the `Join` 
class? e.g.
{code}
case class Join(..., newOuterJoinAttrs: Seq[Attribute])

object Join {
  def apply(...) = {
val newOuterJoinAttrs = joinType match {
  case LeftOuterJoin => right.output.map(_.newInstance)
}
  }
}
{code}

> Use new attributes for columns from outer joins
> ---
>
> Key: SPARK-17995
> URL: https://issues.apache.org/jira/browse/SPARK-17995
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
>Reporter: Ryan Blue
>
> Plans involving outer joins use the same attribute reference (by exprId) to 
> reference columns above the join and below the join. This is a false 
> equivalence that leads to bugs like SPARK-16181, in which an attributes were 
> incorrectly replaced by the optimizer. The column has a different schema 
> above the outer join because its values may be null. The fix for that issue, 
> [PR #13884](https://github.com/apache/spark/pull/13884) has a TODO comment 
> from [~cloud_fan] to fix this by using different attributes instead of 
> needing to special-case outer joins in rules and this issue is to track that 
> improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587307#comment-15587307
 ] 

Apache Spark commented on SPARK-17637:
--

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15541

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Assignee: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17985) Bump commons-lang3 version to 3.5.

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17985:


Assignee: Takuya Ueshin  (was: Apache Spark)

> Bump commons-lang3 version to 3.5.
> --
>
> Key: SPARK-17985
> URL: https://issues.apache.org/jira/browse/SPARK-17985
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.1.0
>
>
> {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break 
> thread safety, which gets stack sometimes caused by race condition of 
> initializing hash map.
> See https://issues.apache.org/jira/browse/LANG-1251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17985) Bump commons-lang3 version to 3.5.

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17985:


Assignee: Apache Spark  (was: Takuya Ueshin)

> Bump commons-lang3 version to 3.5.
> --
>
> Key: SPARK-17985
> URL: https://issues.apache.org/jira/browse/SPARK-17985
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
> Fix For: 2.1.0
>
>
> {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break 
> thread safety, which gets stack sometimes caused by race condition of 
> initializing hash map.
> See https://issues.apache.org/jira/browse/LANG-1251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587239#comment-15587239
 ] 

Dongjoon Hyun commented on SPARK-17982:
---

I'll make a PR for this today.

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587222#comment-15587222
 ] 

Dongjoon Hyun edited comment on SPARK-17982 at 10/19/16 12:55 AM:
--

Sorry, [~tafra...@gmail.com]. Now, I understand what you meant by `limit`.
The following is the simplified version of your case, isn't it?
{code}
scala> spark.version
res0: String = 2.1.0-SNAPSHOT

scala> sql("CREATE TABLE tbl(id INT)")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("CREATE VIEW v2 AS SELECT id FROM tbl limit 2")
res3: org.apache.spark.sql.DataFrame = []

scala> sql("CREATE VIEW v3(id2) AS SELECT id FROM tbl limit 2")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
{code}


was (Author: dongjoon):
Sorry, [~tafra...@gmail.com]. Now, I understand what you meant by `limit`.
The following is the simplified version of your case, isn't it?
{code}
scala> spark.version
res0: String = 2.1.0-SNAPSHOT

scala> sql("CREATE TABLE tbl(id INT)")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("CREATE VIEW v2 AS SELECT id FROM tbl limit 2")
res3: org.apache.spark.sql.DataFrame = []

scala> sql("CREATE VIEW v2(id2) AS SELECT id FROM tbl limit 2")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
{code}

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587222#comment-15587222
 ] 

Dongjoon Hyun commented on SPARK-17982:
---

Sorry, [~tafra...@gmail.com]. Now, I understand what you meant by `limit`.
The following is the simplified version of your case, isn't it?
{code}
scala> spark.version
res0: String = 2.1.0-SNAPSHOT

scala> sql("CREATE TABLE tbl(id INT)")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("CREATE VIEW v2 AS SELECT id FROM tbl limit 2")
res3: org.apache.spark.sql.DataFrame = []

scala> sql("CREATE VIEW v2(id2) AS SELECT id FROM tbl limit 2")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
{code}

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-10-18 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-14393:
-
Affects Version/s: 2.0.0
   2.0.1

> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0, 2.0.1
>Reporter: Jason Piper
>  Labels: correctness
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-10-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14393:
--
Labels: correctness  (was: )

> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jason Piper
>  Labels: correctness
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587124#comment-15587124
 ] 

Dongjoon Hyun commented on SPARK-17982:
---

I'll investigate it for you, [~tafra...@gmail.com].

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17994) Add back a file status cache for catalog tables

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587106#comment-15587106
 ] 

Apache Spark commented on SPARK-17994:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15539

> Add back a file status cache for catalog tables
> ---
>
> Key: SPARK-17994
> URL: https://issues.apache.org/jira/browse/SPARK-17994
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Eric Liang
>
> In SPARK-16980, we removed the full in-memory cache of table partitions in 
> favor of loading only needed partitions from the metastore. This greatly 
> improves the initial latency of queries that only read a small fraction of 
> table partitions.
> However, since the metastore does not store file statistics, we need to 
> discover those from remote storage. With the loss of the in-memory file 
> status cache this has to happen on each query, increasing the latency of 
> repeated queries over the same partitions.
> The proposal is to add back a per-table cache of partition contents, i.e. 
> Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can 
> be invalidated through refreshTable() and refreshByPath(). Unlike the prior 
> cache, it can be incrementally updated as new partitions are read.
> cc [~michael]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17994) Add back a file status cache for catalog tables

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17994:


Assignee: (was: Apache Spark)

> Add back a file status cache for catalog tables
> ---
>
> Key: SPARK-17994
> URL: https://issues.apache.org/jira/browse/SPARK-17994
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Eric Liang
>
> In SPARK-16980, we removed the full in-memory cache of table partitions in 
> favor of loading only needed partitions from the metastore. This greatly 
> improves the initial latency of queries that only read a small fraction of 
> table partitions.
> However, since the metastore does not store file statistics, we need to 
> discover those from remote storage. With the loss of the in-memory file 
> status cache this has to happen on each query, increasing the latency of 
> repeated queries over the same partitions.
> The proposal is to add back a per-table cache of partition contents, i.e. 
> Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can 
> be invalidated through refreshTable() and refreshByPath(). Unlike the prior 
> cache, it can be incrementally updated as new partitions are read.
> cc [~michael]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17994) Add back a file status cache for catalog tables

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17994:


Assignee: Apache Spark

> Add back a file status cache for catalog tables
> ---
>
> Key: SPARK-17994
> URL: https://issues.apache.org/jira/browse/SPARK-17994
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> In SPARK-16980, we removed the full in-memory cache of table partitions in 
> favor of loading only needed partitions from the metastore. This greatly 
> improves the initial latency of queries that only read a small fraction of 
> table partitions.
> However, since the metastore does not store file statistics, we need to 
> discover those from remote storage. With the loss of the in-memory file 
> status cache this has to happen on each query, increasing the latency of 
> repeated queries over the same partitions.
> The proposal is to add back a per-table cache of partition contents, i.e. 
> Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can 
> be invalidated through refreshTable() and refreshByPath(). Unlike the prior 
> cache, it can be incrementally updated as new partitions are read.
> cc [~michael]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17993) Spark spews a slew of harmless but annoying warning messages from Parquet when reading parquet files written by older versions of Parquet-mr

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17993:


Assignee: Apache Spark

> Spark spews a slew of harmless but annoying warning messages from Parquet 
> when reading parquet files written by older versions of Parquet-mr
> 
>
> Key: SPARK-17993
> URL: https://issues.apache.org/jira/browse/SPARK-17993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Allman
>Assignee: Apache Spark
>
> It looks like https://github.com/apache/spark/pull/14690 broke parquet log 
> output redirection. After that patch, when querying parquet files written by 
> Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from 
> the Parquet reader:
> {code}
> Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: 
> Ignoring statistics because created_by could not be parsed (see PARQUET-251): 
> parquet-mr version 1.6.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
>   at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>   at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
>   at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This only happens during execution, not planning, and it doesn't matter what 
> log level the {{SparkContext}} is set to.
> This is a regression I noted as something we needed to fix as a follow up to 
> PR 14690. I feel responsible, so I'm going to expedite a fix for it. I 
> suspect that PR broke Spark's Parquet log output redirection. That's the 
> premise I'm going by.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Assigned] (SPARK-17993) Spark spews a slew of harmless but annoying warning messages from Parquet when reading parquet files written by older versions of Parquet-mr

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17993:


Assignee: (was: Apache Spark)

> Spark spews a slew of harmless but annoying warning messages from Parquet 
> when reading parquet files written by older versions of Parquet-mr
> 
>
> Key: SPARK-17993
> URL: https://issues.apache.org/jira/browse/SPARK-17993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Allman
>
> It looks like https://github.com/apache/spark/pull/14690 broke parquet log 
> output redirection. After that patch, when querying parquet files written by 
> Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from 
> the Parquet reader:
> {code}
> Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: 
> Ignoring statistics because created_by could not be parsed (see PARQUET-251): 
> parquet-mr version 1.6.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
>   at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>   at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
>   at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This only happens during execution, not planning, and it doesn't matter what 
> log level the {{SparkContext}} is set to.
> This is a regression I noted as something we needed to fix as a follow up to 
> PR 14690. I feel responsible, so I'm going to expedite a fix for it. I 
> suspect that PR broke Spark's Parquet log output redirection. That's the 
> premise I'm going by.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17993) Spark spews a slew of harmless but annoying warning messages from Parquet when reading parquet files written by older versions of Parquet-mr

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586987#comment-15586987
 ] 

Apache Spark commented on SPARK-17993:
--

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/15538

> Spark spews a slew of harmless but annoying warning messages from Parquet 
> when reading parquet files written by older versions of Parquet-mr
> 
>
> Key: SPARK-17993
> URL: https://issues.apache.org/jira/browse/SPARK-17993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Allman
>
> It looks like https://github.com/apache/spark/pull/14690 broke parquet log 
> output redirection. After that patch, when querying parquet files written by 
> Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from 
> the Parquet reader:
> {code}
> Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: 
> Ignoring statistics because created_by could not be parsed (see PARQUET-251): 
> parquet-mr version 1.6.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
>   at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>   at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
>   at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This only happens during execution, not planning, and it doesn't matter what 
> log level the {{SparkContext}} is set to.
> This is a regression I noted as something we needed to fix as a follow up to 
> PR 14690. I feel responsible, so I'm going to expedite a fix for it. I 
> suspect that PR broke Spark's Parquet log output redirection. That's the 
> premise I'm going by.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail:

[jira] [Commented] (SPARK-17630) jvm-exit-on-fatal-error handler for spark.rpc.netty like there is available for akka

2016-10-18 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586979#comment-15586979
 ] 

Shixiong Zhu commented on SPARK-17630:
--

Yeah, I think we can set up SparkUncaughtExceptionHandler for R and Python 
users.

> jvm-exit-on-fatal-error handler for spark.rpc.netty like there is available 
> for akka
> 
>
> Key: SPARK-17630
> URL: https://issues.apache.org/jira/browse/SPARK-17630
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Mario Briggs
> Attachments: SecondCodePath.txt, firstCodepath.txt
>
>
> Hi,
> I have 2 code-paths from my app that result in a jvm OOM. 
> In the first code path, 'akka.jvm-exit-on-fatal-error' kicks in and shuts 
> down the JVM, so that the caller (py4J) get notified with proper stack trace. 
> Attached stack-trace file (firstCodepath.txt)
> In the 2nd code path (rpc.netty), no such handler kicks in and shutdown the 
> JVM, so the caller does not get notified. 
> Attached stack-trace file (SecondCodepath.txt)
> Is it possible to have an jvm exit handle for the rpc. netty path?
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17995) Use new attributes for columns from outer joins

2016-10-18 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586972#comment-15586972
 ] 

Ryan Blue commented on SPARK-17995:
---

[~cloud_fan] and [~yhuai], I'd like to help fix this, but I'm not sure the best 
way.

I started to write an analyzer rule that uses transformUp on the initial 
logical plan, before unresovled aliases are resolved. That rule would find 
outer joins and generate a map of attributes to replace to the new attribute 
above the join, with the schema updated to be nullable and with a new exprId. 
The attributes to replace come from the output of the outer join.

Where I ran into trouble was in replacing the attributes in the logical plan 
above the a join. I don't think it is a good idea to have cases in the rule for 
every possible plan, so I think we need a method to substitute attributes that 
is implemented by nodes in the plan. That sounds like a larger patch than I 
originally thought, so I wanted to make sure I'm going down the right path 
before I put up a PR for it. What do you think?

> Use new attributes for columns from outer joins
> ---
>
> Key: SPARK-17995
> URL: https://issues.apache.org/jira/browse/SPARK-17995
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
>Reporter: Ryan Blue
>
> Plans involving outer joins use the same attribute reference (by exprId) to 
> reference columns above the join and below the join. This is a false 
> equivalence that leads to bugs like SPARK-16181, in which an attributes were 
> incorrectly replaced by the optimizer. The column has a different schema 
> above the outer join because its values may be null. The fix for that issue, 
> [PR #13884](https://github.com/apache/spark/pull/13884) has a TODO comment 
> from [~cloud_fan] to fix this by using different attributes instead of 
> needing to special-case outer joins in rules and this issue is to track that 
> improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17995) Use new attributes for columns from outer joins

2016-10-18 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-17995:
-

 Summary: Use new attributes for columns from outer joins
 Key: SPARK-17995
 URL: https://issues.apache.org/jira/browse/SPARK-17995
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0, 1.6.2, 2.1.0
Reporter: Ryan Blue


Plans involving outer joins use the same attribute reference (by exprId) to 
reference columns above the join and below the join. This is a false 
equivalence that leads to bugs like SPARK-16181, in which an attributes were 
incorrectly replaced by the optimizer. The column has a different schema above 
the outer join because its values may be null. The fix for that issue, [PR 
#13884](https://github.com/apache/spark/pull/13884) has a TODO comment from 
[~cloud_fan] to fix this by using different attributes instead of needing to 
special-case outer joins in rules and this issue is to track that improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Franck Tago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586918#comment-15586918
 ] 

Franck Tago commented on SPARK-17982:
-

Updated the  Title of the  Jira . 

I looked at the views.scala file and I want to know if setting the flag  
spark.sql.nativeView.canonical to false is an acceptable  workaround.  

I tested it and it works  but the question is that an acceptable workaround.

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Franck Tago (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-17982:

Summary: Spark 2.0.0  CREATE VIEW statement fails :: 
java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is 
possible there is a bug in Spark.  (was: Spark 2.0.0  CREATE VIEW statement 
fails when select statement contains limit clause)

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17993) Spark spews a slew of harmless but annoying warning messages from Parquet when reading parquet files written by older versions of Parquet-mr

2016-10-18 Thread Michael Allman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586897#comment-15586897
 ] 

Michael Allman commented on SPARK-17993:


cc [~ekhliang]

I think I have a fix for this. I'm going to submit a PR shortly.

> Spark spews a slew of harmless but annoying warning messages from Parquet 
> when reading parquet files written by older versions of Parquet-mr
> 
>
> Key: SPARK-17993
> URL: https://issues.apache.org/jira/browse/SPARK-17993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Allman
>
> It looks like https://github.com/apache/spark/pull/14690 broke parquet log 
> output redirection. After that patch, when querying parquet files written by 
> Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from 
> the Parquet reader:
> {code}
> Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: 
> Ignoring statistics because created_by could not be parsed (see PARQUET-251): 
> parquet-mr version 1.6.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
>   at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>   at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
>   at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This only happens during execution, not planning, and it doesn't matter what 
> log level the {{SparkContext}} is set to.
> This is a regression I noted as something we needed to fix as a follow up to 
> PR 14690. I feel responsible, so I'm going to expedite a fix for it. I 
> suspect that PR broke Spark's Parquet log output redirection. That's the 
> premise I'm going by.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail:

[jira] [Commented] (SPARK-17992) HiveClient.getPartitionsByFilter throws an exception for some unsupported filters when hive.metastore.try.direct.sql=false

2016-10-18 Thread Michael Allman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586892#comment-15586892
 ] 

Michael Allman commented on SPARK-17992:


cc [~ekhliang] [~cloud_fan]

> HiveClient.getPartitionsByFilter throws an exception for some unsupported 
> filters when hive.metastore.try.direct.sql=false
> --
>
> Key: SPARK-17992
> URL: https://issues.apache.org/jira/browse/SPARK-17992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Allman
>
> We recently added (and enabled by default) table partition pruning for 
> partitioned Hive tables converted to using {{TableFileCatalog}}. When the 
> Hive configuration option {{hive.metastore.try.direct.sql}} is set to 
> {{false}}, Hive will throw an exception for unsupported filter expressions. 
> For example, attempting to filter on an integer partition column will throw a 
> {{org.apache.hadoop.hive.metastore.api.MetaException}}.
> I discovered this behavior because VideoAmp uses the CDH version of Hive with 
> a Postgresql metastore DB. In this configuration, CDH sets 
> {{hive.metastore.try.direct.sql}} to {{false}} by default, and queries that 
> filter on a non-string partition column will fail. That would be a rather 
> rude surprise for these Spark 2.1 users...
> I'm not sure exactly what behavior we should expect, but I suggest that 
> {{HiveClientImpl.getPartitionsByFilter}} catch this metastore exception and 
> return all partitions instead. This is what Spark does for Hive 0.12 users, 
> which does not support this feature at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17990) ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names

2016-10-18 Thread Michael Allman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586888#comment-15586888
 ] 

Michael Allman commented on SPARK-17990:


CC [~ekhliang] [~cloud_fan]

> ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition 
> column names
> ---
>
> Key: SPARK-17990
> URL: https://issues.apache.org/jira/browse/SPARK-17990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Linux
> Mac OS with a case-sensitive filesystem
>Reporter: Michael Allman
>
> Writing partition data to an external table's file location and then adding 
> those as table partition metadata is a common use case. However, for tables 
> with partition column names with upper case letters, the SQL command {{ALTER 
> TABLE ... ADD PARTITION}} does not work, as illustrated in the following 
> example:
> {code}
> scala> sql("create external table mixed_case_partitioning (a bigint) 
> PARTITIONED BY (partCol bigint) STORED AS parquet LOCATION 
> '/tmp/mixed_case_partitioning'")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sqlContext.range(10).selectExpr("id as a", "id as 
> partCol").write.partitionBy("partCol").mode("overwrite").parquet("/tmp/mixed_case_partitioning")
> {code}
> At this point, doing a {{hadoop fs -ls /tmp/mixed_case_partitioning}} 
> produces the following:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 11 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> {code}
> Returning to the Spark shell, we execute the following to add the partition 
> metadata:
> {code}
> scala> (0 to 9).foreach { p => sql(s"alter table mixed_case_partitioning add 
> partition(partCol=$p)") }
> {code}
> Examining the HDFS file listing again, we see:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 21 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=6
> drwxr-xr-x

[jira] [Commented] (SPARK-17711) Compress rolled executor logs

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586842#comment-15586842
 ] 

Apache Spark commented on SPARK-17711:
--

User 'loneknightpy' has created a pull request for this issue:
https://github.com/apache/spark/pull/15537

> Compress rolled executor logs
> -
>
> Key: SPARK-17711
> URL: https://issues.apache.org/jira/browse/SPARK-17711
> Project: Spark
>  Issue Type: New Feature
>Reporter: Yu Peng
> Fix For: 2.0.2, 2.2.0
>
>
> Currently, rolled executor logs are not compressed. If the executor produces 
> a lot of logs, it may consume all executor disk space and fail the task. With 
> this feature, executor will compress the rolled stderr/stdout like log4j to 
> reduce disk usage. 
> [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause

2016-10-18 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586829#comment-15586829
 ] 

Dongjoon Hyun commented on SPARK-17982:
---

Hi, [~tafra...@gmail.com].
Could you update your JIRA title?
As you see in [~jiangxb]'s example, `limit` is not a problem.

> Spark 2.0.0  CREATE VIEW statement fails when select statement contains limit 
> clause
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12149) Executor UI improvement suggestions - Color UI

2016-10-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586725#comment-15586725
 ] 

Reynold Xin edited comment on SPARK-12149 at 10/18/16 9:31 PM:
---

So I looked at the UI today and I have to say the color code was *extremely 
confusing*. On a normal cluster I saw a bunch of reds, which is usually 
reserved for errors.

(I'm going to leave a comment on the JIRA ticket too)


was (Author: rxin):
So I looked at the UI today and I have to say the color code is extremely 
confusing. On a normal cluster I saw a bunch of reds, which is usually reserved 
for errors.

(I'm going to leave a comment on the JIRA ticket too)

> Executor UI improvement suggestions - Color UI
> --
>
> Key: SPARK-12149
> URL: https://issues.apache.org/jira/browse/SPARK-12149
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Alex Bozarth
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> Splitting off the Color UI portion of the parent UI improvements task, 
> description copied below:
> Fill some of the cells with color in order to make it easier to absorb the 
> info, e.g.
> RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
> the red)
> GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
> number)
> Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
> on the log(# completed)
> if dark blue then write the value in white (same for the RED and GREEN above
> Merging another idea from SPARK-2132: 
> Color GC time red when over a percentage of task time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12149) Executor UI improvement suggestions - Color UI

2016-10-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586725#comment-15586725
 ] 

Reynold Xin commented on SPARK-12149:
-

So I looked at the UI today and I have to say the color code is extremely 
confusing. On a normal cluster I saw a bunch of reds, which is usually reserved 
for errors.

(I'm going to leave a comment on the JIRA ticket too)

> Executor UI improvement suggestions - Color UI
> --
>
> Key: SPARK-12149
> URL: https://issues.apache.org/jira/browse/SPARK-12149
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Alex Bozarth
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> Splitting off the Color UI portion of the parent UI improvements task, 
> description copied below:
> Fill some of the cells with color in order to make it easier to absorb the 
> info, e.g.
> RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
> the red)
> GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
> number)
> Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
> on the log(# completed)
> if dark blue then write the value in white (same for the RED and GREEN above
> Merging another idea from SPARK-2132: 
> Color GC time red when over a percentage of task time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15708) Tasks table in Detailed Stage page shows ip instead of hostname under Executor ID/Host

2016-10-18 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586703#comment-15586703
 ] 

Alex Bozarth commented on SPARK-15708:
--

I have not seen this recently so it was possibly incidentally fixed. I had this 
on my back burner and did the above research before seeing it had been closed 
so I wanted to share it for future reference.

> Tasks table in Detailed Stage page shows ip instead of hostname under 
> Executor ID/Host
> --
>
> Key: SPARK-15708
> URL: https://issues.apache.org/jira/browse/SPARK-15708
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Minor
>
> If you go to the detailed Stages page in Spark 2.0, the Tasks table under the 
> Executor ID/Host columns hosts the hostname as an ip address rather then a 
> fully qualified hostname.
> The table above it (Aggregated Metrics by Executor) shows the "Address" as 
> the full hostname.
> I'm running spark on yarn on latest branch-2.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13275:


Assignee: (was: Apache Spark)

> With dynamic allocation, executors appear to be added before job starts
> ---
>
> Key: SPARK-13275
> URL: https://issues.apache.org/jira/browse/SPARK-13275
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Stephanie Bodoff
>Priority: Minor
> Attachments: webui.png
>
>
> When I look at the timeline in the Spark Web UI I see the job starting and 
> then executors being added. The blue lines and dots hitting the timeline show 
> that the executors were added after the job started. But the way the Executor 
> box is rendered it looks like the executors started before the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586680#comment-15586680
 ] 

Apache Spark commented on SPARK-13275:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/15536

> With dynamic allocation, executors appear to be added before job starts
> ---
>
> Key: SPARK-13275
> URL: https://issues.apache.org/jira/browse/SPARK-13275
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Stephanie Bodoff
>Priority: Minor
> Attachments: webui.png
>
>
> When I look at the timeline in the Spark Web UI I see the job starting and 
> then executors being added. The blue lines and dots hitting the timeline show 
> that the executors were added after the job started. But the way the Executor 
> box is rendered it looks like the executors started before the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts

2016-10-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13275:


Assignee: Apache Spark

> With dynamic allocation, executors appear to be added before job starts
> ---
>
> Key: SPARK-13275
> URL: https://issues.apache.org/jira/browse/SPARK-13275
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Stephanie Bodoff
>Assignee: Apache Spark
>Priority: Minor
> Attachments: webui.png
>
>
> When I look at the timeline in the Spark Web UI I see the job starting and 
> then executors being added. The blue lines and dots hitting the timeline show 
> that the executors were added after the job started. But the way the Executor 
> box is rendered it looks like the executors started before the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15769) Add Encoder for input type to Aggregator

2016-10-18 Thread Koert Kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586654#comment-15586654
 ] 

Koert Kuipers commented on SPARK-15769:
---

what i really want is to be able to use Aggregator in DataFrame, without
having to resort to taking apart Row. this way we can develop all general
aggregation algorithms in Aggregator and use it across the board in
spark-sql.

to be specific: in RelationalGroupedDataset i want to be able to apply an
Aggregator to input specified by the column names, and the input gets
converted to the input type of the Aggregator, just like we do in UDFs and
such.

an example (taken from my pullreq) where ComplexResultAgg is an
Aggregator[(String, Long), (Long, Long), (Long, Long)]:
val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i", "j", "k
")
df3.groupBy("i").agg(ComplexResultAgg("i", "k"))

this applies the Aggregator to columns "i" and "k"

i found creating a inputDeserializer from the encoder to be the easiest way
to make this all work, plus my pullreq removes a lof of adhoc stuff (all
the withInputType stuff) indicating to me its cleaner. i also like how this
catches mistakes earlier (because you need an implicit encoder) versus
storing TypeTags etc. and creating deserializer/converter expressions at
runtime. but yeah maybe i am misunderstanding encoders and input data type
is all we need.





> Add Encoder for input type to Aggregator
> 
>
> Key: SPARK-15769
> URL: https://issues.apache.org/jira/browse/SPARK-15769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: koert kuipers
>Priority: Minor
>
> Currently org.apache.spark.sql.expressions.Aggregator has Encoders for its 
> buffer and output type, but not for its input type. The thought is that the 
> input type is known from the Dataset it operates on and hence can be inserted 
> later.
> However i think there are compelling reasons to have Aggregator carry an 
> Encoder for its input type:
> * Generally transformations on Dataset only require the Encoder for the 
> result type since the input type is exactly known and it's Encoder is already 
> available within the Dataset. However this is not the case for an Aggregator: 
> an Aggregator is defined independently of a Dataset, and i think it should be 
> generally desirable that an Aggregator work on any type that can safely be 
> cast to the Aggregator's input type (for example an Aggregator that has Long 
> as input should work on a Dataset of Ints).
> * Aggregators should also work on DataFrames, because its a much nicer API to 
> use than UserDefinedAggregateFunction. And when operating on DataFrames you 
> should not have to use Row objects, which means your input type is not equal 
> to the type of the Dataset you operate on (so the Encoder of the Dataset that 
> is operated on should not be used as input Encoder for the Aggregator).
> * Adding an input Encoder is not a big burden, since it can typically be 
> created implicitly
> * It removes TypedColumn.withInputType and its usage in Dataset, 
> KeyValueGroupedDataset and RelationalGroupedDataset, which always felt 
> somewhat ad-hoc to me
> * Once an Aggregator has an Encoder for it's input type it is a small change 
> to make the Aggregator also work on a subset of columns in a DataFrame, which 
> facilitates Aggregator re-use since you don't have to write a custom 
> Aggregator to extract the columns from a specific DataFrame. This also 
> enables a usage that is more typical within a DataFrame context, very similar 
> to how a UserDefinedAggregateFunction is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17841) Kafka 0.10 commitQueue needs to be drained

2016-10-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17841.
-
   Resolution: Fixed
 Assignee: Cody Koeninger
Fix Version/s: 2.1.0
   2.0.2

> Kafka 0.10 commitQueue needs to be drained
> --
>
> Key: SPARK-17841
> URL: https://issues.apache.org/jira/browse/SPARK-17841
> Project: Spark
>  Issue Type: Bug
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
> Fix For: 2.0.2, 2.1.0
>
>
> Current implementation is just iterating, not polling and removing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17985) Bump commons-lang3 version to 3.5.

2016-10-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586644#comment-15586644
 ] 

Reynold Xin edited comment on SPARK-17985 at 10/18/16 8:59 PM:
---

The patch was reverted due to build failures in Hadoop 2.2:

Actually I'm seeing the following exceptions locally as well as on Jenkins for 
Hadoop 2.2:

{noformat}
[error] 
/scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: 
value read is not a member of object org.apache.commons.io.IOUtils
[error]   var numBytes = IOUtils.read(gzInputStream, buf)
[error]  ^
[error] 
/scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: 
value read is not a member of object org.apache.commons.io.IOUtils
[error] numBytes = IOUtils.read(gzInputStream, buf)
[error]^
{noformat}

I'm going to revert the patch for now.



was (Author: rxin):
The patch was reverted due to build failures in Hadoop 2.2:

Actually I'm seeing the following exceptions locally as well as on Jenkins for 
Hadoop 2.2:

```
[error] 
/scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: 
value read is not a member of object org.apache.commons.io.IOUtils
[error]   var numBytes = IOUtils.read(gzInputStream, buf)
[error]  ^
[error] 
/scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: 
value read is not a member of object org.apache.commons.io.IOUtils
[error] numBytes = IOUtils.read(gzInputStream, buf)
[error]^
```

I'm going to revert the patch for now.


> Bump commons-lang3 version to 3.5.
> --
>
> Key: SPARK-17985
> URL: https://issues.apache.org/jira/browse/SPARK-17985
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.1.0
>
>
> {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break 
> thread safety, which gets stack sometimes caused by race condition of 
> initializing hash map.
> See https://issues.apache.org/jira/browse/LANG-1251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-17985) Bump commons-lang3 version to 3.5.

2016-10-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-17985:
-

The patch was reverted due to build failures in Hadoop 2.2:

Actually I'm seeing the following exceptions locally as well as on Jenkins for 
Hadoop 2.2:

```
[error] 
/scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: 
value read is not a member of object org.apache.commons.io.IOUtils
[error]   var numBytes = IOUtils.read(gzInputStream, buf)
[error]  ^
[error] 
/scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: 
value read is not a member of object org.apache.commons.io.IOUtils
[error] numBytes = IOUtils.read(gzInputStream, buf)
[error]^
```

I'm going to revert the patch for now.


> Bump commons-lang3 version to 3.5.
> --
>
> Key: SPARK-17985
> URL: https://issues.apache.org/jira/browse/SPARK-17985
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.1.0
>
>
> {{SerializationUtils.clone()}} of commons-lang3 (<3.5) has a bug that break 
> thread safety, which gets stack sometimes caused by race condition of 
> initializing hash map.
> See https://issues.apache.org/jira/browse/LANG-1251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17731) Metrics for Structured Streaming

2016-10-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586621#comment-15586621
 ] 

Apache Spark commented on SPARK-17731:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/15535

> Metrics for Structured Streaming
> 
>
> Key: SPARK-17731
> URL: https://issues.apache.org/jira/browse/SPARK-17731
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.2, 2.1.0
>
>
> Metrics are needed for monitoring structured streaming apps. Here is the 
> design doc for implementing the necessary metrics.
> https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 203 matches

Mail list logo