date:20161004

[jira] [Assigned] (SPARK-17783) Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP Table for JDBC

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17783:


Assignee: (was: Apache Spark)

> Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP 
> Table for JDBC
> ---
>
> Key: SPARK-17783
> URL: https://issues.apache.org/jira/browse/SPARK-17783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Xiao Li
>Priority: Critical
>
> We should never expose the Credentials in the EXPLAIN and DESC 
> FORMATTED/EXTENDED command. However, below commands exposed the credentials. 
> {noformat}
> CREATE TABLE tab1 USING org.apache.spark.sql.jdbc
> {noformat}
> {noformat}
> == Physical Plan ==
> ExecutedCommand
>+- CreateDataSourceTableCommand CatalogTable(
>   Table: `tab1`
>   Created: Tue Oct 04 21:39:44 PDT 2016
>   Last Access: Wed Dec 31 15:59:59 PST 1969
>   Type: MANAGED
>   Provider: org.apache.spark.sql.jdbc
>   Storage(Properties: 
> [url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, 
> dbtable=TEST.PEOPLE, user=testUser, password=testPass])), false
> {noformat}
> {noformat}
> DESC FORMATTED tab1
> {noformat}
> {noformat}
> ...
> |# Storage Information   |
>   |   |
> |Compressed: |No  
>   |   |
> |Storage Desc Parameters:|
>   |   |
> |  path  
> |file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1|   |
> |  url   
> |jdbc:h2:mem:testdb0;user=testUser;password=testPass   |   |
> |  dbtable   |TEST.PEOPLE 
>   |   |
> |  user  |testUser
>   |   |
> |  password  |testPass
>   |   |
> ++--+---+
> {noformat}
> {noformat}
> DESC EXTENDED tab1
> {noformat}
> {noformat}
> ...
>   Storage(Properties: 
> [path=file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1, 
> url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, 
> user=testUser, password=testPass]))|   |
> {noformat}
> {noformat}
> CREATE TEMP VIEW tab1 USING org.apache.spark.sql.jdbc
> {noformat}
> {noformat}
> == Physical Plan ==
> ExecutedCommand
>+- CreateTempViewUsing `tab1`, false, org.apache.spark.sql.jdbc, Map(url 
> -> jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable -> 
> TEST.PEOPLE, user -> testUser, password -> testPass)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17783) Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP Table for JDBC

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547664#comment-15547664
 ] 

Apache Spark commented on SPARK-17783:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15358

> Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP 
> Table for JDBC
> ---
>
> Key: SPARK-17783
> URL: https://issues.apache.org/jira/browse/SPARK-17783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Xiao Li
>Priority: Critical
>
> We should never expose the Credentials in the EXPLAIN and DESC 
> FORMATTED/EXTENDED command. However, below commands exposed the credentials. 
> {noformat}
> CREATE TABLE tab1 USING org.apache.spark.sql.jdbc
> {noformat}
> {noformat}
> == Physical Plan ==
> ExecutedCommand
>+- CreateDataSourceTableCommand CatalogTable(
>   Table: `tab1`
>   Created: Tue Oct 04 21:39:44 PDT 2016
>   Last Access: Wed Dec 31 15:59:59 PST 1969
>   Type: MANAGED
>   Provider: org.apache.spark.sql.jdbc
>   Storage(Properties: 
> [url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, 
> dbtable=TEST.PEOPLE, user=testUser, password=testPass])), false
> {noformat}
> {noformat}
> DESC FORMATTED tab1
> {noformat}
> {noformat}
> ...
> |# Storage Information   |
>   |   |
> |Compressed: |No  
>   |   |
> |Storage Desc Parameters:|
>   |   |
> |  path  
> |file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1|   |
> |  url   
> |jdbc:h2:mem:testdb0;user=testUser;password=testPass   |   |
> |  dbtable   |TEST.PEOPLE 
>   |   |
> |  user  |testUser
>   |   |
> |  password  |testPass
>   |   |
> ++--+---+
> {noformat}
> {noformat}
> DESC EXTENDED tab1
> {noformat}
> {noformat}
> ...
>   Storage(Properties: 
> [path=file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1, 
> url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, 
> user=testUser, password=testPass]))|   |
> {noformat}
> {noformat}
> CREATE TEMP VIEW tab1 USING org.apache.spark.sql.jdbc
> {noformat}
> {noformat}
> == Physical Plan ==
> ExecutedCommand
>+- CreateTempViewUsing `tab1`, false, org.apache.spark.sql.jdbc, Map(url 
> -> jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable -> 
> TEST.PEOPLE, user -> testUser, password -> testPass)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17783) Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP Table for JDBC

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17783:


Assignee: Apache Spark

> Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP 
> Table for JDBC
> ---
>
> Key: SPARK-17783
> URL: https://issues.apache.org/jira/browse/SPARK-17783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
>
> We should never expose the Credentials in the EXPLAIN and DESC 
> FORMATTED/EXTENDED command. However, below commands exposed the credentials. 
> {noformat}
> CREATE TABLE tab1 USING org.apache.spark.sql.jdbc
> {noformat}
> {noformat}
> == Physical Plan ==
> ExecutedCommand
>+- CreateDataSourceTableCommand CatalogTable(
>   Table: `tab1`
>   Created: Tue Oct 04 21:39:44 PDT 2016
>   Last Access: Wed Dec 31 15:59:59 PST 1969
>   Type: MANAGED
>   Provider: org.apache.spark.sql.jdbc
>   Storage(Properties: 
> [url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, 
> dbtable=TEST.PEOPLE, user=testUser, password=testPass])), false
> {noformat}
> {noformat}
> DESC FORMATTED tab1
> {noformat}
> {noformat}
> ...
> |# Storage Information   |
>   |   |
> |Compressed: |No  
>   |   |
> |Storage Desc Parameters:|
>   |   |
> |  path  
> |file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1|   |
> |  url   
> |jdbc:h2:mem:testdb0;user=testUser;password=testPass   |   |
> |  dbtable   |TEST.PEOPLE 
>   |   |
> |  user  |testUser
>   |   |
> |  password  |testPass
>   |   |
> ++--+---+
> {noformat}
> {noformat}
> DESC EXTENDED tab1
> {noformat}
> {noformat}
> ...
>   Storage(Properties: 
> [path=file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1, 
> url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, 
> user=testUser, password=testPass]))|   |
> {noformat}
> {noformat}
> CREATE TEMP VIEW tab1 USING org.apache.spark.sql.jdbc
> {noformat}
> {noformat}
> == Physical Plan ==
> ExecutedCommand
>+- CreateTempViewUsing `tab1`, false, org.apache.spark.sql.jdbc, Map(url 
> -> jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable -> 
> TEST.PEOPLE, user -> testUser, password -> testPass)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17783) Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP Table for JDBC

2016-10-04 Thread Xiao Li (JIRA)

Xiao Li created SPARK-17783:
---

 Summary: Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a 
PERSISTENT/TEMP Table for JDBC
 Key: SPARK-17783
 URL: https://issues.apache.org/jira/browse/SPARK-17783
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 2.1.0
Reporter: Xiao Li
Priority: Critical


We should never expose the Credentials in the EXPLAIN and DESC 
FORMATTED/EXTENDED command

{noformat}
CREATE TABLE tab1 USING org.apache.spark.sql.jdbc
{noformat}

{noformat}
== Physical Plan ==
ExecutedCommand
   +- CreateDataSourceTableCommand CatalogTable(
Table: `tab1`
Created: Tue Oct 04 21:39:44 PDT 2016
Last Access: Wed Dec 31 15:59:59 PST 1969
Type: MANAGED
Provider: org.apache.spark.sql.jdbc
Storage(Properties: 
[url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, 
user=testUser, password=testPass])), false
{noformat}
{noformat}
DESC FORMATTED tab1
{noformat}
{noformat}
...
|# Storage Information   |  
|   |
|Compressed: |No
|   |
|Storage Desc Parameters:|  
|   |
|  path  
|file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1|   |
|  url   
|jdbc:h2:mem:testdb0;user=testUser;password=testPass   |   |
|  dbtable   |TEST.PEOPLE   
|   |
|  user  |testUser  
|   |
|  password  |testPass  
|   |
++--+---+
{noformat}


{noformat}
DESC EXTENDED tab1
{noformat}
{noformat}
...
Storage(Properties: 
[path=file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1, 
url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, 
user=testUser, password=testPass]))|   |
{noformat}

{noformat}
CREATE TEMP VIEW tab1 USING org.apache.spark.sql.jdbc
{noformat}
{noformat}
== Physical Plan ==
ExecutedCommand
   +- CreateTempViewUsing `tab1`, false, org.apache.spark.sql.jdbc, Map(url -> 
jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable -> TEST.PEOPLE, 
user -> testUser, password -> testPass)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17783) Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP Table for JDBC

2016-10-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17783:

Description: 
We should never expose the Credentials in the EXPLAIN and DESC 
FORMATTED/EXTENDED command. However, below commands exposed the credentials. 

{noformat}
CREATE TABLE tab1 USING org.apache.spark.sql.jdbc
{noformat}

{noformat}
== Physical Plan ==
ExecutedCommand
   +- CreateDataSourceTableCommand CatalogTable(
Table: `tab1`
Created: Tue Oct 04 21:39:44 PDT 2016
Last Access: Wed Dec 31 15:59:59 PST 1969
Type: MANAGED
Provider: org.apache.spark.sql.jdbc
Storage(Properties: 
[url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, 
user=testUser, password=testPass])), false
{noformat}
{noformat}
DESC FORMATTED tab1
{noformat}
{noformat}
...
|# Storage Information   |  
|   |
|Compressed: |No
|   |
|Storage Desc Parameters:|  
|   |
|  path  
|file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1|   |
|  url   
|jdbc:h2:mem:testdb0;user=testUser;password=testPass   |   |
|  dbtable   |TEST.PEOPLE   
|   |
|  user  |testUser  
|   |
|  password  |testPass  
|   |
++--+---+
{noformat}


{noformat}
DESC EXTENDED tab1
{noformat}
{noformat}
...
Storage(Properties: 
[path=file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1, 
url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, 
user=testUser, password=testPass]))|   |
{noformat}

{noformat}
CREATE TEMP VIEW tab1 USING org.apache.spark.sql.jdbc
{noformat}
{noformat}
== Physical Plan ==
ExecutedCommand
   +- CreateTempViewUsing `tab1`, false, org.apache.spark.sql.jdbc, Map(url -> 
jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable -> TEST.PEOPLE, 
user -> testUser, password -> testPass)
{noformat}


  was:
We should never expose the Credentials in the EXPLAIN and DESC 
FORMATTED/EXTENDED command

{noformat}
CREATE TABLE tab1 USING org.apache.spark.sql.jdbc
{noformat}

{noformat}
== Physical Plan ==
ExecutedCommand
   +- CreateDataSourceTableCommand CatalogTable(
Table: `tab1`
Created: Tue Oct 04 21:39:44 PDT 2016
Last Access: Wed Dec 31 15:59:59 PST 1969
Type: MANAGED
Provider: org.apache.spark.sql.jdbc
Storage(Properties: 
[url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, 
user=testUser, password=testPass])), false
{noformat}
{noformat}
DESC FORMATTED tab1
{noformat}
{noformat}
...
|# Storage Information   |  
|   |
|Compressed: |No
|   |
|Storage Desc Parameters:|  
|   |
|  path  
|file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1|   |
|  url   
|jdbc:h2:mem:testdb0;user=testUser;password=testPass   |   |
|  dbtable   |TEST.PEOPLE   
|   |
|  user  |testUser  
|   |
|  password  |testPass  
|   |
++--+---+
{noformat}


{noformat}
DESC EXTENDED tab1
{noformat}
{noformat}
...
Storage(Properties: 
[path=file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1, 
url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, 
user=testUser, password=testPass]))|   |
{noformat}

{noformat}
CREATE TEMP VIEW tab1 USING org.apache.spark.sql.jdbc
{noformat}
{noformat}
== Physical Plan ==
ExecutedCommand
   +- CreateTempViewUsing `tab1`, false, org.apache.spark.sql.jdbc, Map(url -> 
jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable -> TEST.PEOPLE, 
user -> testUser, password -> testPass)
{noformat}



> Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP 
> Table for JDBC
> ---
>
>

[jira] [Commented] (SPARK-17328) NPE with EXPLAIN DESCRIBE TABLE

2016-10-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547617#comment-15547617
 ] 

Dongjoon Hyun commented on SPARK-17328:
---

Thank you, [~hvanhovell]. Sorry for missing your comment.
A few minutes ago, I made a PR for this. It fixes two kinds of error in 
DESCRIBE and EXPLAIN.

> NPE with EXPLAIN DESCRIBE TABLE
> ---
>
> Key: SPARK-17328
> URL: https://issues.apache.org/jira/browse/SPARK-17328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> With today's build:
> {code}
> scala> sql("EXPLAIN DESCRIBE TABLE x").show(truncate = false)
> INFO SparkSqlParser: Parsing command: EXPLAIN DESCRIBE TABLE x
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:88)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:88)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:182)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> while the following executes fine:
> {code}
> scala> sql("describe table x").explain
> INFO SparkSqlParser: Parsing command: describe table x
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> describe table x
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> I think it's related to the condition in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L262.
> If guided I'd like to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17328) NPE with EXPLAIN DESCRIBE TABLE

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547587#comment-15547587
 ] 

Apache Spark commented on SPARK-17328:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15357

> NPE with EXPLAIN DESCRIBE TABLE
> ---
>
> Key: SPARK-17328
> URL: https://issues.apache.org/jira/browse/SPARK-17328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> With today's build:
> {code}
> scala> sql("EXPLAIN DESCRIBE TABLE x").show(truncate = false)
> INFO SparkSqlParser: Parsing command: EXPLAIN DESCRIBE TABLE x
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:88)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:88)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:182)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> while the following executes fine:
> {code}
> scala> sql("describe table x").explain
> INFO SparkSqlParser: Parsing command: describe table x
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> describe table x
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> I think it's related to the condition in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L262.
> If guided I'd like to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17328) NPE with EXPLAIN DESCRIBE TABLE

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17328:


Assignee: Apache Spark

> NPE with EXPLAIN DESCRIBE TABLE
> ---
>
> Key: SPARK-17328
> URL: https://issues.apache.org/jira/browse/SPARK-17328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
>
> With today's build:
> {code}
> scala> sql("EXPLAIN DESCRIBE TABLE x").show(truncate = false)
> INFO SparkSqlParser: Parsing command: EXPLAIN DESCRIBE TABLE x
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:88)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:88)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:182)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> while the following executes fine:
> {code}
> scala> sql("describe table x").explain
> INFO SparkSqlParser: Parsing command: describe table x
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> describe table x
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> I think it's related to the condition in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L262.
> If guided I'd like to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17328) NPE with EXPLAIN DESCRIBE TABLE

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17328:


Assignee: (was: Apache Spark)

> NPE with EXPLAIN DESCRIBE TABLE
> ---
>
> Key: SPARK-17328
> URL: https://issues.apache.org/jira/browse/SPARK-17328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> With today's build:
> {code}
> scala> sql("EXPLAIN DESCRIBE TABLE x").show(truncate = false)
> INFO SparkSqlParser: Parsing command: EXPLAIN DESCRIBE TABLE x
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:88)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:88)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:182)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> while the following executes fine:
> {code}
> scala> sql("describe table x").explain
> INFO SparkSqlParser: Parsing command: describe table x
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> describe table x
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> I think it's related to the condition in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L262.
> If guided I'd like to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17782) Kafka 010 test is flaky

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17782:


Assignee: Apache Spark

> Kafka 010 test is flaky
> ---
>
> Key: SPARK-17782
> URL: https://issues.apache.org/jira/browse/SPARK-17782
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> The Kafka 010 DirectKafkaStreamSuite {{pattern based subscription}} is flaky. 
> We should disable it, and figure out how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17782) Kafka 010 test is flaky

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17782:


Assignee: (was: Apache Spark)

> Kafka 010 test is flaky
> ---
>
> Key: SPARK-17782
> URL: https://issues.apache.org/jira/browse/SPARK-17782
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Herman van Hovell
>
> The Kafka 010 DirectKafkaStreamSuite {{pattern based subscription}} is flaky. 
> We should disable it, and figure out how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17782) Kafka 010 test is flaky

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547466#comment-15547466
 ] 

Apache Spark commented on SPARK-17782:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15355

> Kafka 010 test is flaky
> ---
>
> Key: SPARK-17782
> URL: https://issues.apache.org/jira/browse/SPARK-17782
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Herman van Hovell
>
> The Kafka 010 DirectKafkaStreamSuite {{pattern based subscription}} is flaky. 
> We should disable it, and figure out how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17782) Kafka 010 test is flaky

2016-10-04 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-17782:
-

 Summary: Kafka 010 test is flaky
 Key: SPARK-17782
 URL: https://issues.apache.org/jira/browse/SPARK-17782
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Herman van Hovell


The Kafka 010 DirectKafkaStreamSuite {{pattern based subscription}} is flaky. 
We should disable it, and figure out how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17764) to_json function for parsing Structs to json Strings

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547434#comment-15547434
 ] 

Apache Spark commented on SPARK-17764:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15354

> to_json function for parsing Structs to json Strings
> 
>
> Key: SPARK-17764
> URL: https://issues.apache.org/jira/browse/SPARK-17764
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> After SPARK-17699, now Spark supprots {{from_json}}. It might be nicer if we 
> have {{to_json}} too, in particular, in the case to write out dataframes by 
> some data sources not supporting nested structured types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17764) to_json function for parsing Structs to json Strings

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17764:


Assignee: (was: Apache Spark)

> to_json function for parsing Structs to json Strings
> 
>
> Key: SPARK-17764
> URL: https://issues.apache.org/jira/browse/SPARK-17764
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> After SPARK-17699, now Spark supprots {{from_json}}. It might be nicer if we 
> have {{to_json}} too, in particular, in the case to write out dataframes by 
> some data sources not supporting nested structured types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17764) to_json function for parsing Structs to json Strings

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17764:


Assignee: Apache Spark

> to_json function for parsing Structs to json Strings
> 
>
> Key: SPARK-17764
> URL: https://issues.apache.org/jira/browse/SPARK-17764
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> After SPARK-17699, now Spark supprots {{from_json}}. It might be nicer if we 
> have {{to_json}} too, in particular, in the case to write out dataframes by 
> some data sources not supporting nested structured types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17774) Add support for head on DataFrame Column

2016-10-04 Thread Oscar D. Lara Yejas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547429#comment-15547429
 ] 

Oscar D. Lara Yejas commented on SPARK-17774:
-

[~shivaram]: I concur with Shivaram. Besides, I already implemented method 
head() in my PR 11336:

https://github.com/apache/spark/pull/11336

If you wanted to implement method head() alone, you'll still need to do all 
changes I did for PR 11336 except for the 5 lines of code of method collect(). 
If that's the case, I'd rather suggest to merge PR 11336.

[~falaki]: In the corner cases where there's no parent DataFrame, we can return 
an empty value as opposed to throwing an error. This behavior is already 
implemented in PR 11336. Also, though R doesn't have method collect(), I think 
it's still useful to turn a Column into an R vector. Perhaps a function called 
as.vector()?

Thanks folks!





> Add support for head on DataFrame Column
> 
>
> Key: SPARK-17774
> URL: https://issues.apache.org/jira/browse/SPARK-17774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> There was a lot of discussion on SPARK-9325. To summarize the conversation on 
> that ticket regardign {{collect}}
> * Pro: Ease of use and maximum compatibility with existing R API
> * Con: We do not want to increase maintenance cost by opening arbitrary API. 
> With Spark's DataFrame API {{collect}} does not work on {{Column}} and there 
> is no need for it to work in R.
> This ticket is strictly about {{head}}. I propose supporting {{head}} on 
> {{Column}} because:
> 1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they 
> do that on SparkDataFrame they get an error. Not a good experience
> 2. Adding support for it does not require any change to the backend. It can 
> be trivially done in R code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17774) Add support for head on DataFrame Column

2016-10-04 Thread Oscar D. Lara Yejas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547429#comment-15547429
 ] 

Oscar D. Lara Yejas edited comment on SPARK-17774 at 10/5/16 2:48 AM:
--

I concur with [~shivaram]. Besides, I already implemented method head() in my 
PR 11336:

https://github.com/apache/spark/pull/11336

If you wanted to implement method head() alone, you'll still need to do all 
changes I did for PR 11336 except for the 5 lines of code of method collect(). 
If that's the case, I'd rather suggest to merge PR 11336.

[~falaki]: In the corner cases where there's no parent DataFrame, we can return 
an empty value as opposed to throwing an error. This behavior is already 
implemented in PR 11336. Also, though R doesn't have method collect(), I think 
it's still useful to turn a Column into an R vector. Perhaps a function called 
as.vector()?

Thanks folks!






was (Author: olarayej):
[~shivaram]: I concur with Shivaram. Besides, I already implemented method 
head() in my PR 11336:

https://github.com/apache/spark/pull/11336

If you wanted to implement method head() alone, you'll still need to do all 
changes I did for PR 11336 except for the 5 lines of code of method collect(). 
If that's the case, I'd rather suggest to merge PR 11336.

[~falaki]: In the corner cases where there's no parent DataFrame, we can return 
an empty value as opposed to throwing an error. This behavior is already 
implemented in PR 11336. Also, though R doesn't have method collect(), I think 
it's still useful to turn a Column into an R vector. Perhaps a function called 
as.vector()?

Thanks folks!





> Add support for head on DataFrame Column
> 
>
> Key: SPARK-17774
> URL: https://issues.apache.org/jira/browse/SPARK-17774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> There was a lot of discussion on SPARK-9325. To summarize the conversation on 
> that ticket regardign {{collect}}
> * Pro: Ease of use and maximum compatibility with existing R API
> * Con: We do not want to increase maintenance cost by opening arbitrary API. 
> With Spark's DataFrame API {{collect}} does not work on {{Column}} and there 
> is no need for it to work in R.
> This ticket is strictly about {{head}}. I propose supporting {{head}} on 
> {{Column}} because:
> 1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they 
> do that on SparkDataFrame they get an error. Not a good experience
> 2. Adding support for it does not require any change to the backend. It can 
> be trivially done in R code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17495) Hive hash implementation

2016-10-04 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17495:
--
Issue Type: Improvement  (was: Bug)

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17495) Hive hash implementation

2016-10-04 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17495.
---
   Resolution: Fixed
 Assignee: Tejas Patil
Fix Version/s: 2.1.0

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-10-04 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547358#comment-15547358
 ] 

Herman van Hovell commented on SPARK-17728:
---

The second one does not trigger the behavior because this is turned into a 
LocalRelation, these are evaluated during optimization and before we collapse 
projections.

The following in memory dataframe should trigger the same behavior:
{noformat}
spark.range(1, 10).withColumn("expensive_udf_result", 
fUdf($"id")).withColumn("b", $"expensive_udf_result" + 100)
{noformat}

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17728) UDFs are run too many times

2016-10-04 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547358#comment-15547358
 ] 

Herman van Hovell edited comment on SPARK-17728 at 10/5/16 1:56 AM:


The second one does not trigger the behavior because this is turned into a 
LocalRelation, these are evaluated during optimization and before we collapse 
projections.

The following in memory dataframe should trigger the same behavior:
{noformat}
spark.range(1, 10)
 .withColumn("expensive_udf_result", fUdf($"id"))
 .withColumn("b", $"expensive_udf_result" + 100)
{noformat}


was (Author: hvanhovell):
The second one does not trigger the behavior because this is turned into a 
LocalRelation, these are evaluated during optimization and before we collapse 
projections.

The following in memory dataframe should trigger the same behavior:
{noformat}
spark.range(1, 10).withColumn("expensive_udf_result", 
fUdf($"id")).withColumn("b", $"expensive_udf_result" + 100)
{noformat}

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-10-04 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547352#comment-15547352
 ] 

Herman van Hovell commented on SPARK-17728:
---

There are different evaluation paths in Spark SQL:
- Interpreted. Expressions are evaluated using an eval(...) method. Plans are 
evaluated using iterators (volcano model). This is what I mean by the 
completely interpreted path.
- Expression Codegenerated. This means that all expressions are evaluated using 
a code generated function. Plans are evaluated using iterators.
- Wholestafe Codegenerated. All expressions and most plans are evaluated using 
code generation.

I think you are using whole stage code generation. This does not support common 
subexpression elimination.



> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-04 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547300#comment-15547300
 ] 

Herman van Hovell commented on SPARK-17758:
---

That is correct. First keeps track of the fact that its value has been set. 
Last does not.

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>  Labels: correctness
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-04 Thread Franck Tago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547292#comment-15547292
 ] 

Franck Tago commented on SPARK-17758:
-

My first impression after taking a look at  First.scala leads me to conclude 
that the issue  depicted here should not happen for the First aggregate 
function call . 

Is that correct ?

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>  Labels: correctness
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-04 Thread Hossein Falaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17781:
---
Description: 
When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{code}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function(x) { return(x$date) })
{code}

  was:
When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{{code}}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function( x ) {
  return(x$date)
})
{{code}}


> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-04 Thread Hossein Falaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17781:
---
Affects Version/s: 2.0.0

> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-04 Thread Hossein Falaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17781:
---
Affects Version/s: (was: 2.0.1)

> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-04 Thread Hossein Falaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17781:
---
Description: 
When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{{code}}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function( x ) { return(x$date) })
{{code}}

  was:
When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{{code}}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function(x) { return(x$date) })
{{code}}


> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {{code}}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function( x ) { return(x$date) })
> {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-04 Thread Hossein Falaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17781:
---
Description: 
When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{{code}}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function( x ) {
  return(x$date)
})
{{code}}

  was:
When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{{code}}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function( x ) { return(x$date) })
{{code}}


> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {{code}}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function( x ) {
>   return(x$date)
> })
> {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17775) pyspark: take(num) failed, but collect() worked for big dataset

2016-10-04 Thread Rick Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547257#comment-15547257
 ] 

Rick Lin commented on SPARK-17775:
--

Yes, i will try a new version

> pyspark: take(num) failed, but collect() worked for big dataset
> ---
>
> Key: SPARK-17775
> URL: https://issues.apache.org/jira/browse/SPARK-17775
> Project: Spark
>  Issue Type: Bug
> Environment: Spark:1.6.1
> Python 2.7.12 :: Anaconda 4.1.1 (64-bit)
> Windows 7
> One machine
>Reporter: Rick Lin
>
> Hi, all:
> I ran one dataset with 39,501 data drew from the table of PostgreSQL DB in 
> pyspark.
> The code was as:
> cur1.execute("select id from users")
> users = cur1.fetchall()
> users_rdd = sc.parallelize(users)
> users_rdd.take(1)
> where the error message was as:
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> However, when i changed from take(1) to collect(), it worked, as: 
> [[25],
>  [1439],
> ...
> ]
> When I ran the same code for a small dataset, here take(1) and collect() 
> worked.
> I don't know why this happened and how to fix this problem for a big dataset?
> Could you help me to deal with this problem?
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17774) Add support for head on DataFrame Column

2016-10-04 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547232#comment-15547232
 ] 

Hossein Falaki commented on SPARK-17774:


Putting implementation aside, throwing an error for {{head(df$col)}} is bad 
user experience. For all the corner cases where user calls {{head}} on a column 
that do not belong to any DataFrame we can throw appropriate error. Before I 
submit a PR, I would like to get consensus here. 

Also I think {{head}} is more important that {{collect}} because, {{collect}} 
is not an existing R function.

CC [~marmbrus] [~sunrui] [~felixcheung] [~olarayej]

> Add support for head on DataFrame Column
> 
>
> Key: SPARK-17774
> URL: https://issues.apache.org/jira/browse/SPARK-17774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> There was a lot of discussion on SPARK-9325. To summarize the conversation on 
> that ticket regardign {{collect}}
> * Pro: Ease of use and maximum compatibility with existing R API
> * Con: We do not want to increase maintenance cost by opening arbitrary API. 
> With Spark's DataFrame API {{collect}} does not work on {{Column}} and there 
> is no need for it to work in R.
> This ticket is strictly about {{head}}. I propose supporting {{head}} on 
> {{Column}} because:
> 1. R users are already used to calling {{head(iris$Sepal.Length)}}. When they 
> do that on SparkDataFrame they get an error. Not a good experience
> 2. Adding support for it does not require any change to the backend. It can 
> be trivially done in R code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter

2016-10-04 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547216#comment-15547216
 ] 

Yin Huai commented on SPARK-10063:
--

[~ste...@apache.org] I took a quick look at hadoop 1 
(https://github.com/apache/hadoop/blob/release-1.2.1/src/mapred/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L111)
 and hadoop 2 
(https://github.com/apache/hadoop/blob/branch-2.7.3/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L326).
 Seems Hadoop 1 actually uses algorithm 2. Is my understanding correct?

> Remove DirectParquetOutputCommitter
> ---
>
> Key: SPARK-10063
> URL: https://issues.apache.org/jira/browse/SPARK-10063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we use DirectParquetOutputCommitter on S3 and speculation is enabled, 
> there is a chance that we can loss data. 
> Here is the code to reproduce the problem.
> {code}
> import org.apache.spark.sql.functions._
> val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: 
> Int, partitionId: Int, attemptNumber: Int) => {
>   if (partitionId == 0 && i == 5) {
> if (attemptNumber > 0) {
>   Thread.sleep(15000)
>   throw new Exception("new exception")
> } else {
>   Thread.sleep(1)
> }
>   }
>   
>   i
> })
> val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
>   val context = org.apache.spark.TaskContext.get()
>   val partitionId = context.partitionId
>   val attemptNumber = context.attemptNumber
>   iter.map(i => (i, partitionId, attemptNumber))
> }.toDF("i", "partitionId", "attemptNumber")
> df
>   .select(failSpeculativeTask($"i", $"partitionId", 
> $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
>   .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")
> sqlContext.read.load("/home/yin/outputCommitter").count
> // The result is 99 and 5 is missing from the output.
> {code}
> What happened is that the original task finishes first and uploads its output 
> file to S3, then the speculative task somehow fails. Because we have to call 
> output stream's close method, which uploads data to S3, we actually uploads 
> the partial result generated by the failed speculative task to S3 and this 
> file overwrites the correct file generated by the original task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17724) Unevaluated new lines in tooltip in DAG Visualization of a job

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17724:


Assignee: (was: Apache Spark)

> Unevaluated new lines in tooltip in DAG Visualization of a job
> --
>
> Key: SPARK-17724
> URL: https://issues.apache.org/jira/browse/SPARK-17724
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: 
> spark-webui-job-details-dagvisualization-newlines-broken.png
>
>
> The tooltips in DAG Visualization for a job show new lines verbatim 
> (unevaluated).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17724) Unevaluated new lines in tooltip in DAG Visualization of a job

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17724:


Assignee: Apache Spark

> Unevaluated new lines in tooltip in DAG Visualization of a job
> --
>
> Key: SPARK-17724
> URL: https://issues.apache.org/jira/browse/SPARK-17724
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
> Attachments: 
> spark-webui-job-details-dagvisualization-newlines-broken.png
>
>
> The tooltips in DAG Visualization for a job show new lines verbatim 
> (unevaluated).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17724) Unevaluated new lines in tooltip in DAG Visualization of a job

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547204#comment-15547204
 ] 

Apache Spark commented on SPARK-17724:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/15353

> Unevaluated new lines in tooltip in DAG Visualization of a job
> --
>
> Key: SPARK-17724
> URL: https://issues.apache.org/jira/browse/SPARK-17724
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: 
> spark-webui-job-details-dagvisualization-newlines-broken.png
>
>
> The tooltips in DAG Visualization for a job show new lines verbatim 
> (unevaluated).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10367) Support Parquet logical type INTERVAL

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10367:


Assignee: Apache Spark

> Support Parquet logical type INTERVAL
> -
>
> Key: SPARK-10367
> URL: https://issues.apache.org/jira/browse/SPARK-10367
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10367) Support Parquet logical type INTERVAL

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547193#comment-15547193
 ] 

Apache Spark commented on SPARK-10367:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/15334

> Support Parquet logical type INTERVAL
> -
>
> Key: SPARK-10367
> URL: https://issues.apache.org/jira/browse/SPARK-10367
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10367) Support Parquet logical type INTERVAL

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10367:


Assignee: (was: Apache Spark)

> Support Parquet logical type INTERVAL
> -
>
> Key: SPARK-10367
> URL: https://issues.apache.org/jira/browse/SPARK-10367
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10364) Support Parquet logical type TIMESTAMP_MILLIS

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10364:


Assignee: (was: Apache Spark)

> Support Parquet logical type TIMESTAMP_MILLIS
> -
>
> Key: SPARK-10364
> URL: https://issues.apache.org/jira/browse/SPARK-10364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>
> The {{TimestampType}} in Spark SQL is of microsecond precision. Ideally, we 
> should convert Spark SQL timestamp values into Parquet {{TIMESTAMP_MICROS}}. 
> But unfortunately parquet-mr hasn't supported it yet.
> For the read path, we should be able to read {{TIMESTAMP_MILLIS}} Parquet 
> values and pad a 0 microsecond part to read values.
> For the write path, currently we are writing timestamps as {{INT96}}, similar 
> to Impala and Hive. One alternative is that, we can have a separate SQL 
> option to let users be able to write Spark SQL timestamp values as 
> {{TIMESTAMP_MILLIS}}. Of course, in this way the microsecond part will be 
> truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10364) Support Parquet logical type TIMESTAMP_MILLIS

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547189#comment-15547189
 ] 

Apache Spark commented on SPARK-10364:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/15332

> Support Parquet logical type TIMESTAMP_MILLIS
> -
>
> Key: SPARK-10364
> URL: https://issues.apache.org/jira/browse/SPARK-10364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>
> The {{TimestampType}} in Spark SQL is of microsecond precision. Ideally, we 
> should convert Spark SQL timestamp values into Parquet {{TIMESTAMP_MICROS}}. 
> But unfortunately parquet-mr hasn't supported it yet.
> For the read path, we should be able to read {{TIMESTAMP_MILLIS}} Parquet 
> values and pad a 0 microsecond part to read values.
> For the write path, currently we are writing timestamps as {{INT96}}, similar 
> to Impala and Hive. One alternative is that, we can have a separate SQL 
> option to let users be able to write Spark SQL timestamp values as 
> {{TIMESTAMP_MILLIS}}. Of course, in this way the microsecond part will be 
> truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10364) Support Parquet logical type TIMESTAMP_MILLIS

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10364:


Assignee: Apache Spark

> Support Parquet logical type TIMESTAMP_MILLIS
> -
>
> Key: SPARK-10364
> URL: https://issues.apache.org/jira/browse/SPARK-10364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> The {{TimestampType}} in Spark SQL is of microsecond precision. Ideally, we 
> should convert Spark SQL timestamp values into Parquet {{TIMESTAMP_MICROS}}. 
> But unfortunately parquet-mr hasn't supported it yet.
> For the read path, we should be able to read {{TIMESTAMP_MILLIS}} Parquet 
> values and pad a 0 microsecond part to read values.
> For the write path, currently we are writing timestamps as {{INT96}}, similar 
> to Impala and Hive. One alternative is that, we can have a separate SQL 
> option to let users be able to write Spark SQL timestamp values as 
> {{TIMESTAMP_MILLIS}}. Of course, in this way the microsecond part will be 
> truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17328) NPE with EXPLAIN DESCRIBE TABLE

2016-10-04 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547107#comment-15547107
 ] 

Herman van Hovell commented on SPARK-17328:
---

Go ahead. The actual fix is a one liner. just add {{TABLE?}} to the 
{{#describeTable}} rule.

> NPE with EXPLAIN DESCRIBE TABLE
> ---
>
> Key: SPARK-17328
> URL: https://issues.apache.org/jira/browse/SPARK-17328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> With today's build:
> {code}
> scala> sql("EXPLAIN DESCRIBE TABLE x").show(truncate = false)
> INFO SparkSqlParser: Parsing command: EXPLAIN DESCRIBE TABLE x
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:88)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:88)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:182)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> while the following executes fine:
> {code}
> scala> sql("describe table x").explain
> INFO SparkSqlParser: Parsing command: describe table x
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> describe table x
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> I think it's related to the condition in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L262.
> If guided I'd like to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17328) NPE with EXPLAIN DESCRIBE TABLE

2016-10-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547102#comment-15547102
 ] 

Dongjoon Hyun commented on SPARK-17328:
---

Thank YOU, [~ja...@japila.pl]!

> NPE with EXPLAIN DESCRIBE TABLE
> ---
>
> Key: SPARK-17328
> URL: https://issues.apache.org/jira/browse/SPARK-17328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> With today's build:
> {code}
> scala> sql("EXPLAIN DESCRIBE TABLE x").show(truncate = false)
> INFO SparkSqlParser: Parsing command: EXPLAIN DESCRIBE TABLE x
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:88)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:88)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:182)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> while the following executes fine:
> {code}
> scala> sql("describe table x").explain
> INFO SparkSqlParser: Parsing command: describe table x
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> describe table x
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> I think it's related to the condition in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L262.
> If guided I'd like to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17328) NPE with EXPLAIN DESCRIBE TABLE

2016-10-04 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547096#comment-15547096
 ] 

Jacek Laskowski commented on SPARK-17328:
-

Sure! Go for it! Thanks.

> NPE with EXPLAIN DESCRIBE TABLE
> ---
>
> Key: SPARK-17328
> URL: https://issues.apache.org/jira/browse/SPARK-17328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> With today's build:
> {code}
> scala> sql("EXPLAIN DESCRIBE TABLE x").show(truncate = false)
> INFO SparkSqlParser: Parsing command: EXPLAIN DESCRIBE TABLE x
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:88)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:88)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:182)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> while the following executes fine:
> {code}
> scala> sql("describe table x").explain
> INFO SparkSqlParser: Parsing command: describe table x
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> describe table x
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> I think it's related to the condition in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L262.
> If guided I'd like to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17064) Reconsider spark.job.interruptOnCancel

2016-10-04 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547073#comment-15547073
 ] 

Mridul Muralidharan commented on SPARK-17064:
-


I agree, interrupt'ing a thread can (and usually does) have unintended side 
effects.
On plus side, our current model does mean that any actual processing of tuple 
will cause the task to get killed ...

Perhaps we can check this state in shuffle fetch as well to further avoid 
expensive prefix operations (data fetch, etc) ?
This will ofcourse not catch the case of user code doing something extremely 
long running and expensive while processing a single tuple - but then there is 
no gaurantee thread interruption will work in that case anyway.

> Reconsider spark.job.interruptOnCancel
> --
>
> Key: SPARK-17064
> URL: https://issues.apache.org/jira/browse/SPARK-17064
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Mark Hamstra
>
> There is a frequent need or desire in Spark to cancel already running Tasks.  
> This has been recognized for a very long time (see, e.g., the ancient TODO 
> comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've 
> never had more than an incomplete solution.  Killing running Tasks at the 
> Executor level has been implemented by interrupting the threads running the 
> Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since 
> https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3
>  addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting 
> Task threads in this way has only been possible if interruptThread is true, 
> and that typically comes from the setting of the interruptOnCancel property 
> in the JobGroup, which in turn typically comes from the setting of 
> spark.job.interruptOnCancel.  Because of concerns over 
> https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes 
> being marked dead when a Task thread is interrupted, the default value of the 
> boolean has been "false" -- i.e. by default we do not interrupt Tasks already 
> running on Executor even when the Task has been canceled in the DAGScheduler, 
> or the Stage has been abort, or the Job has been killed, etc.
> There are several issues resulting from this current state of affairs, and 
> they each probably need to spawn their own JIRA issue and PR once we decide 
> on an overall strategy here.  Among those issues:
> * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS 
> versions that Spark now supports so that we can set the default value of 
> spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
> * Even if interrupting Task threads is no longer an issue for HDFS, is it 
> still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still 
> need protection similar to what the current default value of 
> spark.job.interruptOnCancel provides?
> * If interrupting Task threads isn't safe enough, what should we do instead?
> * Once we have a safe mechanism to stop and clean up after already executing 
> Tasks, there is still the question of whether we _should_ end executing 
> Tasks.  While that is likely a good thing to do in cases where individual 
> Tasks are lightweight in terms of resource usage, at least in some cases not 
> all running Tasks should be ended: https://github.com/apache/spark/pull/12436 
>  That means that we probably need to continue to make allowing Task 
> interruption configurable at the Job or JobGroup level (and we need better 
> documentation explaining how and when to allow interruption or not.)
> * There is one place in the current code 
> (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to 
> "true".  This should be fixed, and similar misuses of killTask be denied in 
> pull requests until this issue is adequately resolved.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17328) NPE with EXPLAIN DESCRIBE TABLE

2016-10-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547065#comment-15547065
 ] 

Dongjoon Hyun commented on SPARK-17328:
---

Hi, [~ja...@japila.pl] and [~hvanhovell].
May I create a PR for this?

> NPE with EXPLAIN DESCRIBE TABLE
> ---
>
> Key: SPARK-17328
> URL: https://issues.apache.org/jira/browse/SPARK-17328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> With today's build:
> {code}
> scala> sql("EXPLAIN DESCRIBE TABLE x").show(truncate = false)
> INFO SparkSqlParser: Parsing command: EXPLAIN DESCRIBE TABLE x
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:88)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:88)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:182)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> while the following executes fine:
> {code}
> scala> sql("describe table x").explain
> INFO SparkSqlParser: Parsing command: describe table x
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> describe table x
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:569)
>   ... 48 elided
> {code}
> I think it's related to the condition in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L262.
> If guided I'd like to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-17452) Spark 2.0.0 is not supporting the "partition" keyword on a "describe" statement when using Hive Support

2016-10-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-17452.
-
Resolution: Duplicate

Hi, [~hvivani]. 
This was resolved in SPARK-17612.

> Spark 2.0.0 is not supporting the "partition" keyword on a "describe" 
> statement when using Hive Support
> ---
>
> Key: SPARK-17452
> URL: https://issues.apache.org/jira/browse/SPARK-17452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Amazon EMR 5.0.0
>Reporter: Hernan Vivani
>
> Changes introduced in Spark 2 are not supporting the "partition" keyword on a 
> "describe" statement.
> EMR 5 (Spark 2.0):
> ==
> scala> import org.apache.spark.sql.SparkSession
> scala> val 
> sess=SparkSession.builder().appName("test").enableHiveSupport().getOrCreate()
> scala> sess.sql("describe formatted page_view partition (dt='2008-06-08', 
> country='AR')").show 
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> describe formatted page_view partition (dt='2008-06-08', country='AR')
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   ... 48 elided
> Same statement is working fine on Spark 1.6.2 and Spark 1.5.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-04 Thread Hossein Falaki (JIRA)

Hossein Falaki created SPARK-17781:
--

 Summary: datetime is serialized as double inside dapply()
 Key: SPARK-17781
 URL: https://issues.apache.org/jira/browse/SPARK-17781
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Hossein Falaki


When we ship a SparkDataFrame to workers for dapply family functions, inside 
the worker DateTime objects are serialized as double.

To reproduce:
{{code}}
df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
dapplyCollect(df, function(x) { return(x$date) })
{{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17775) pyspark: take(num) failed, but collect() worked for big dataset

2016-10-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547004#comment-15547004
 ] 

Dongjoon Hyun commented on SPARK-17775:
---

Hi, [~Ricklin].
I'm not sure about what is the problem, but could you try with latest versions 
of Spark, e.g., 1.6.2, 2.0.0, 2.0.1?
You can try in Databricks Community Edition, 
https://community.cloud.databricks.com .

> pyspark: take(num) failed, but collect() worked for big dataset
> ---
>
> Key: SPARK-17775
> URL: https://issues.apache.org/jira/browse/SPARK-17775
> Project: Spark
>  Issue Type: Bug
> Environment: Spark:1.6.1
> Python 2.7.12 :: Anaconda 4.1.1 (64-bit)
> Windows 7
> One machine
>Reporter: Rick Lin
>
> Hi, all:
> I ran one dataset with 39,501 data drew from the table of PostgreSQL DB in 
> pyspark.
> The code was as:
> cur1.execute("select id from users")
> users = cur1.fetchall()
> users_rdd = sc.parallelize(users)
> users_rdd.take(1)
> where the error message was as:
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> However, when i changed from take(1) to collect(), it worked, as: 
> [[25],
>  [1439],
> ...
> ]
> When I ran the same code for a small dataset, here take(1) and collect() 
> worked.
> I don't know why this happened and how to fix this problem for a big dataset?
> Could you help me to deal with this problem?
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17780) Report NoClassDefFoundError in StreamExecution

2016-10-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17780:
-
Component/s: SQL

> Report NoClassDefFoundError in StreamExecution
> --
>
> Key: SPARK-17780
> URL: https://issues.apache.org/jira/browse/SPARK-17780
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>
> When using an incompatible source for structured streaming, it may throw 
> NoClassDefFoundError. It's better to catch and report to the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17780) Report NoClassDefFoundError in StreamExecution

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546959#comment-15546959
 ] 

Apache Spark commented on SPARK-17780:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15352

> Report NoClassDefFoundError in StreamExecution
> --
>
> Key: SPARK-17780
> URL: https://issues.apache.org/jira/browse/SPARK-17780
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>
> When using an incompatible source for structured streaming, it may throw 
> NoClassDefFoundError. It's better to catch and report to the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17780) Report NoClassDefFoundError in StreamExecution

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17780:


Assignee: (was: Apache Spark)

> Report NoClassDefFoundError in StreamExecution
> --
>
> Key: SPARK-17780
> URL: https://issues.apache.org/jira/browse/SPARK-17780
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>
> When using an incompatible source for structured streaming, it may throw 
> NoClassDefFoundError. It's better to catch and report to the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17779) starting dse start-spark-sql-thriftserver

2016-10-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546977#comment-15546977
 ] 

Dongjoon Hyun commented on SPARK-17779:
---

Hi, [~epedregosa].

Do you try the following in the manual, too?

{code}
dse spark-sql-thriftserver start
{code}

BTW, the question about the usage of DSE seems to be not proper in Apache 
Spark. :)

> starting dse start-spark-sql-thriftserver 
> --
>
> Key: SPARK-17779
> URL: https://issues.apache.org/jira/browse/SPARK-17779
> Project: Spark
>  Issue Type: Question
>  Components: Build
> Environment: Oracle Linux Server release 6.5
> DSE 4.5.9
>Reporter: Eric Pedregosa
>Priority: Critical
>
> Hi,
> I am trying to install Spark SQL Thrift server against DSE 4.5.9.  I'm trying 
> to follow this link  
> -http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/spark/sparkSqlThriftServer.html,
> I have trouble starting command 'dse start-spark-sql-thriftserver'.  Seems 
> like start-spark-sql-thriftserver is not an available option for the dse 
> command.   Did I miss any step in installing Spark for DSE?  
> [openplatform@ushapls00056la /]$ dse start-spark-sql-thriftserver
> /usr/bin/dse: [-a  -b  -v] | cassandra [options] 
> | cassandra-stop [options] | hadoop [options] | hive [options] | beeline | 
> pig [options] | sqoop [options] | mahout [options] | spark [options] | 
> spark-with-cc [options] | spark-class [options] | spark-submit [options] | 
> spark-class-with-cc [options] | spark-schema [options] | pyspark [options] | 
> shark [options] | hive-schema [options] | esri-import [options] | 
> hive-metastore-migrate [options]
> Thanks.
> Eric



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17780) Report NoClassDefFoundError in StreamExecution

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17780:


Assignee: Apache Spark

> Report NoClassDefFoundError in StreamExecution
> --
>
> Key: SPARK-17780
> URL: https://issues.apache.org/jira/browse/SPARK-17780
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> When using an incompatible source for structured streaming, it may throw 
> NoClassDefFoundError. It's better to catch and report to the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17780) Report NoClassDefFoundError in StreamExecution

2016-10-04 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-17780:


 Summary: Report NoClassDefFoundError in StreamExecution
 Key: SPARK-17780
 URL: https://issues.apache.org/jira/browse/SPARK-17780
 Project: Spark
  Issue Type: Test
Reporter: Shixiong Zhu


When using an incompatible source for structured streaming, it may throw 
NoClassDefFoundError. It's better to catch and report to the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17780) Report NoClassDefFoundError in StreamExecution

2016-10-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17780:
-
Issue Type: Improvement  (was: Test)

> Report NoClassDefFoundError in StreamExecution
> --
>
> Key: SPARK-17780
> URL: https://issues.apache.org/jira/browse/SPARK-17780
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>
> When using an incompatible source for structured streaming, it may throw 
> NoClassDefFoundError. It's better to catch and report to the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17780) Report NoClassDefFoundError in StreamExecution

2016-10-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17780:
-
Affects Version/s: 2.0.1

> Report NoClassDefFoundError in StreamExecution
> --
>
> Key: SPARK-17780
> URL: https://issues.apache.org/jira/browse/SPARK-17780
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>
> When using an incompatible source for structured streaming, it may throw 
> NoClassDefFoundError. It's better to catch and report to the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17074) generate histogram information for column

2016-10-04 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546911#comment-15546911
 ] 

Herman van Hovell commented on SPARK-17074:
---

I would prefer modifying (extending) QuantileSummaries over two table scans. It 
depends on how hard this to do though, we shouldn't block the progress on the 
rest of the CBO effort too much.

> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17612) Support `DESCRIBE table PARTITION` SQL syntax

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546904#comment-15546904
 ] 

Apache Spark commented on SPARK-17612:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15351

> Support `DESCRIBE table PARTITION` SQL syntax
> -
>
> Key: SPARK-17612
> URL: https://issues.apache.org/jira/browse/SPARK-17612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> This issue implements `DESC PARTITION` SQL Syntax again. It was dropped since 
> Spark 2.0.0.
> h4. Spark 2.0.0
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> DESC partitioned_table PARTITION (c='Us', d=1)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   ... 48 elided
> {code}
> h4. Spark 1.6.2
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res2: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> 16/09/20 12:48:36 WARN LazyStruct: Extra bytes detected at the end of the 
> row! Ignoring similar problems.
> ++
> |result  |
> ++
> |a  string|
> |b  int   |
> |c  string|
> |d  string|
> ||
> |# Partition Information  
> |
> |# col_name data_type   comment |
> ||
> |c  string|
> |d  string|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17608) Long type has incorrect serialization/deserialization

2016-10-04 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546889#comment-15546889
 ] 

Shivaram Venkataraman commented on SPARK-17608:
---

I think the loss of precision is orthogonal to the problem of maintaining the 
same schema as we go from R -> JVM -> R. In this case for long data we need 
some way to look at the schema and then say that the doubles need to actually 
be sent with "type = long" in serialize.R and conversely in SerDe.scala we need 
to know that while reading longs we will be getting doubles.  Will this solve 
your problem [~iamthomaspowell] ?

> Long type has incorrect serialization/deserialization
> -
>
> Key: SPARK-17608
> URL: https://issues.apache.org/jira/browse/SPARK-17608
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Thomas Powell
>
> Am hitting issues when using {{dapply}} on a data frame that contains a 
> {{bigint}} in its schema. When this is converted to a SparkR data frame a 
> "bigint" gets converted to a R {{numeric}} type: 
> https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L25.
> However, the R {{numeric}} type gets converted to 
> {{org.apache.spark.sql.types.DoubleType}}: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L97.
> The two directions therefore aren't compatible. If I use the same schema when 
> using dapply (and just an identity function) I will get type collisions 
> because the output type is a double but the schema expects a bigint. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17634) Spark job hangs when using dapply

2016-10-04 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546871#comment-15546871
 ] 

Shivaram Venkataraman commented on SPARK-17634:
---

[~iamthomaspowell] Thanks for the performance profiling and it would be great 
to have a PR with the code change you have above. I think there are a couple of 
issues that I would like to understand better:

(a) The base cost of 2 seconds for any num rows < 1: Does this come from 
launching the R process or is there some other bottleneck here ? Given that 
calls to readTypedObject and readType are called linear number of times, I 
think this either comes down to memory allocation or something like that ?

(b) Slowing down of reads as num rows increases: Does this problem get 
addressed by your change ? Other than the memory allocation bit I think this 
could also happen if the OS is out of memory and the R process is thrashing or 
something like that.

> Spark job hangs when using dapply
> -
>
> Key: SPARK-17634
> URL: https://issues.apache.org/jira/browse/SPARK-17634
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Thomas Powell
>Priority: Critical
>
> I'm running into an issue when using dapply on yarn. I have a data frame 
> backed by files in parquet with around 200 files that is around 2GB. When I 
> load this in with the new partition coalescing it ends up having around 20 
> partitions so each one roughly 100MB. The data frame itself has 4 columns of 
> integers and doubles. If I run a count over this things work fine.
> However, if I add a {{dapply}} in between the read and the {{count}} that 
> just uses an identity function the tasks hang and make no progress. Both the 
> R and Java processes are running on the Spark nodes and are listening on the 
> {{SPARKR_WORKER_PORT}}.
> {{result <- dapply(df, function(x){x}, SparkR::schema(df))}}
> I took a jstack of the Java process and see that it is just listening on the 
> socket but never seems to make any progress. The R process is harder to debug 
> what it is doing.
> {code}
> Thread 112823: (state = IN_NATIVE)
>  - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], 
> int, int, int) @bci=0 (Interpreted frame)
>  - java.net.SocketInputStream.socketRead(java.io.FileDescriptor, byte[], int, 
> int, int) @bci=8, line=116 (Interpreted frame)
>  - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=170 
> (Interpreted frame)
>  - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=141 
> (Interpreted frame)
>  - java.io.BufferedInputStream.fill() @bci=214, line=246 (Interpreted frame)
>  - java.io.BufferedInputStream.read() @bci=12, line=265 (Compiled frame)
>  - java.io.DataInputStream.readInt() @bci=4, line=387 (Compiled frame)
>  - org.apache.spark.api.r.RRunner.org$apache$spark$api$r$RRunner$$read() 
> @bci=4, line=212 (Interpreted frame)
>  - 
> org.apache.spark.api.r.RRunner$$anon$1.(org.apache.spark.api.r.RRunner) 
> @bci=25, line=96 (Interpreted frame)
>  - org.apache.spark.api.r.RRunner.compute(scala.collection.Iterator, int) 
> @bci=109, line=87 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(scala.collection.Iterator)
>  @bci=322, line=59 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(java.lang.Object)
>  @bci=5, line=29 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(scala.collection.Iterator)
>  @bci=59, line=178 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(java.lang.Object)
>  @bci=5, line=175 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(org.apache.spark.TaskContext,
>  int, scala.collection.Iterator) @bci=8, line=784 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(java.lang.Object,
>  java.lang.Object, java.lang.Object) @bci=13, line=784 (Interpreted frame)
>  - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=27, line=38 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=26, line=319 (Interpreted frame)
>  - org.apache.spark.rdd.RDD.iterator(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=33, line=283 (Interpreted frame)
>  - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=24, line=38 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=26, line=319

[jira] [Commented] (SPARK-17777) Spark Scheduler Hangs Indefinitely

2016-10-04 Thread Ameen Tayyebi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546799#comment-15546799
 ] 

Ameen Tayyebi commented on SPARK-1:
---

Thanks for the comment. Can you please clarify why you think this should
not be possible?

Interestingly this works in local mode, however hangs in yarn mode. I'd
expect the same API calls to behave identically (semantically) irrespective
of which more you run Spark on. If this is not supported shouldn't we
expect a hang in all cases?

On Oct 4, 2016 2:50 PM, "Sean Owen (JIRA)"  wrote:

[ https://issues.apache.org/jira/browse/SPARK-1?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel=15546284#comment-15546284 ]

Sean Owen commented on SPARK-1:
---

I think you've demonstrated that this doesn't work, and I wouldn't say
that's supposed to work. The reference you give isn't the same thing in
that it doesn't lead to this hang / infinite loop. So I think the answer
is, you can't do that.

itself when an RDD calls SparkContext.parallelize within its getPartitions
method. This seemingly "recursive" call causes the problem. We have a repro
case that can easily be run.
the mean time.
reproduce the problem.
Well, we have an RDD that is composed of several thousands of Parquet
files. To compute the partitioning strategy for this RDD, we create an RDD
to read all file sizes from S3 in parallel, so that we can quickly
determine the proper partitions. We do this to avoid executing this
serially from the master node which can result in significant slowness in
the execution. Pseudo-code:
s3.getObjectSummary)).collect()
core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

> Spark Scheduler Hangs Indefinitely
> --
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: AWS EMR 4.3, can also be reproduced locally
>Reporter: Ameen Tayyebi
> Attachments: repro.scala
>
>
> We've identified a problem with Spark scheduling. The issue manifests itself 
> when an RDD calls SparkContext.parallelize within its getPartitions method. 
> This seemingly "recursive" call causes the problem. We have a repro case that 
> can easily be run.
> Please advise on what the issue might be and how we can work around it in the 
> mean time.
> I've attached repro.scala which can simply be pasted in spark-shell to 
> reproduce the problem.
> Why are we calling sc.parallelize in production within getPartitions? Well, 
> we have an RDD that is composed of several thousands of Parquet files. To 
> compute the partitioning strategy for this RDD, we create an RDD to read all 
> file sizes from S3 in parallel, so that we can quickly determine the proper 
> partitions. We do this to avoid executing this serially from the master node 
> which can result in significant slowness in the execution. Pseudo-code:
> val splitInfo = sc.parallelize(filePaths).map(f => (f, 
> s3.getObjectSummary)).collect()
> A similar logic is used in DataFrame by Spark itself:
> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902
>  
> Thanks,
> -Ameen

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17417) Fix # of partitions for RDD while checkpointing - Currently limited by 10000(%05d)

2016-10-04 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546792#comment-15546792
 ] 

Dhruve Ashar commented on SPARK-17417:
--

[~srowen] AFAIU the checkpointing mechanism in spark core, the recovery of an 
RDD from a checkpoint is limited to an application attempt. Spark streaming 
mentions that it can recover metadata/rdd from checkpointed data across 
application attempts. Please correct me if I have missed something here. With 
this understanding it wouldn't be necessary to parse the code for the old 
format as the recovery would be done using the same spark jar which was used to 
launch it. 

Also why is it that we are not cleaning up the checkpointed directory on 
sc.close ?

> Fix # of partitions for RDD while checkpointing - Currently limited by 
> 1(%05d)
> --
>
> Key: SPARK-17417
> URL: https://issues.apache.org/jira/browse/SPARK-17417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Dhruve Ashar
>
> Spark currently assumes # of partitions to be less than 10 and uses %05d 
> padding. 
> If we exceed this no., the sort logic in ReliableCheckpointRDD gets messed up 
> and fails. This is because of part-files are sorted and compared as strings. 
> This leads filename order to be part-1, part-10, ... instead of 
> part-1, part-10001, ..., part-10 and while reconstructing the 
> checkpointed RDD the job fails. 
> Possible solutions: 
> - Bump the padding to allow more partitions or
> - Sort the part files extracting a sub-portion as string and then verify the 
> RDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-10-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15390.

Resolution: Fixed

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
>Assignee: Davies Liu
> Fix For: 2.0.1
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at

[jira] [Updated] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-10-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15390:
---
Fix Version/s: 2.0.1

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
>Assignee: Davies Liu
> Fix For: 2.0.1
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at

[jira] [Comment Edited] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-10-04 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546675#comment-15546675
 ] 

Davies Liu edited comment on SPARK-15390 at 10/4/16 9:11 PM:
-

@lulian Dragos I think this is a different issue, fixed by 
https://github.com/apache/spark/pull/14373  in 2.0.1.


was (Author: davies):
@lulian Dragos I think this is a different issue, fixed by 
https://github.com/apache/spark/pull/14373 and 
https://github.com/apache/spark/pull/14464/files in 2.0.1.

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
>Assignee: Davies Liu
> Fix For: 2.0.1
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
>

[jira] [Updated] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-10-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15390:
---
Fix Version/s: (was: 2.0.0)

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
>Assignee: Davies Liu
> Fix For: 2.0.1
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at

[jira] [Commented] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-10-04 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546675#comment-15546675
 ] 

Davies Liu commented on SPARK-15390:


@lulian Dragos I think this is a different issue, fixed by 
https://github.com/apache/spark/pull/14373 and 
https://github.com/apache/spark/pull/14464/files in 2.0.1.

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287)
>   at 
>

[jira] [Created] (SPARK-17779) starting dse start-spark-sql-thriftserver

2016-10-04 Thread Eric Pedregosa (JIRA)

Eric Pedregosa created SPARK-17779:
--

 Summary: starting dse start-spark-sql-thriftserver 
 Key: SPARK-17779
 URL: https://issues.apache.org/jira/browse/SPARK-17779
 Project: Spark
  Issue Type: Question
  Components: Build
 Environment: Oracle Linux Server release 6.5
DSE 4.5.9
Reporter: Eric Pedregosa
Priority: Critical


Hi,

I am trying to install Spark SQL Thrift server against DSE 4.5.9.  I'm trying 
to follow this link  
-http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/spark/sparkSqlThriftServer.html,

I have trouble starting command 'dse start-spark-sql-thriftserver'.  Seems like 
start-spark-sql-thriftserver is not an available option for the dse command.   
Did I miss any step in installing Spark for DSE?  

[openplatform@ushapls00056la /]$ dse start-spark-sql-thriftserver
/usr/bin/dse: [-a  -b  -v] | cassandra [options] | 
cassandra-stop [options] | hadoop [options] | hive [options] | beeline | pig 
[options] | sqoop [options] | mahout [options] | spark [options] | 
spark-with-cc [options] | spark-class [options] | spark-submit [options] | 
spark-class-with-cc [options] | spark-schema [options] | pyspark [options] | 
shark [options] | hive-schema [options] | esri-import [options] | 
hive-metastore-migrate [options]

Thanks.
Eric





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17778) Mock SparkContext to reduce memory usage of BlockManagerSuite

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546618#comment-15546618
 ] 

Apache Spark commented on SPARK-17778:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15350

> Mock SparkContext to reduce memory usage of BlockManagerSuite
> -
>
> Key: SPARK-17778
> URL: https://issues.apache.org/jira/browse/SPARK-17778
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17778) Mock SparkContext to reduce memory usage of BlockManagerSuite

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17778:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Mock SparkContext to reduce memory usage of BlockManagerSuite
> -
>
> Key: SPARK-17778
> URL: https://issues.apache.org/jira/browse/SPARK-17778
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17778) Mock SparkContext to reduce memory usage of BlockManagerSuite

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17778:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Mock SparkContext to reduce memory usage of BlockManagerSuite
> -
>
> Key: SPARK-17778
> URL: https://issues.apache.org/jira/browse/SPARK-17778
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17778) Mock SparkContext to reduce memory usage of BlockManagerSuite

2016-10-04 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-17778:


 Summary: Mock SparkContext to reduce memory usage of 
BlockManagerSuite
 Key: SPARK-17778
 URL: https://issues.apache.org/jira/browse/SPARK-17778
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.0.1
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-10-04 Thread Brian Cho (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546365#comment-15546365
 ] 

Brian Cho commented on SPARK-16827:
---

Sure, having the actual spill metrics is something we're interested in as well. 
I'd like to work on it, but I might not get to it immediately.

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17239) User guide for multiclass logistic regression in spark.ml

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546331#comment-15546331
 ] 

Apache Spark commented on SPARK-17239:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/15349

> User guide for multiclass logistic regression in spark.ml
> -
>
> Key: SPARK-17239
> URL: https://issues.apache.org/jira/browse/SPARK-17239
> Project: Spark
>  Issue Type: Documentation
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17239) User guide for multiclass logistic regression in spark.ml

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17239:


Assignee: Apache Spark  (was: Seth Hendrickson)

> User guide for multiclass logistic regression in spark.ml
> -
>
> Key: SPARK-17239
> URL: https://issues.apache.org/jira/browse/SPARK-17239
> Project: Spark
>  Issue Type: Documentation
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17239) User guide for multiclass logistic regression in spark.ml

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17239:


Assignee: Seth Hendrickson  (was: Apache Spark)

> User guide for multiclass logistic regression in spark.ml
> -
>
> Key: SPARK-17239
> URL: https://issues.apache.org/jira/browse/SPARK-17239
> Project: Spark
>  Issue Type: Documentation
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-10-04 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546330#comment-15546330
 ] 

Reynold Xin commented on SPARK-16827:
-

We should probably separate the on-disk spill from the shuffle size. Would you 
have time to work on that?


> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-04 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17758:
--
Labels: correctness  (was: )

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>  Labels: correctness
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17734) inner equi-join shorthand that returns Datasets, like DataFrame already has

2016-10-04 Thread Leif Warner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546303#comment-15546303
 ] 

Leif Warner commented on SPARK-17734:
-

That works, I was just hoping for a helper method that does essentially that, 
similar to how you can say `table1.join(table2, "value")` right now and get a 
DataFrame. Just to be more concise and less error-prone.

> inner equi-join shorthand that returns Datasets, like DataFrame already has
> ---
>
> Key: SPARK-17734
> URL: https://issues.apache.org/jira/browse/SPARK-17734
> Project: Spark
>  Issue Type: Wish
>Reporter: Leif Warner
>Priority: Minor
>
> There's an existing ".join(right: Dataset[_], usingColumn: String): 
> DataFrame" method on Dataset.
> Would appreciate it if a variant that returns typed Datasets would also 
> available.
> If you write a join that contains the common column name name, you get an 
> AnalysisException thrown because that's ambiguous, e.g:
> $"foo" === $"foo"
> So I wrote table1.toDF()("foo") === table2.toDF()("foo"), but that's a little 
> error prone, and coworkers considered it a hack and didn't want to use it, 
> because it "mixes DataFrame and Dataset api".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17777) Spark Scheduler Hangs Indefinitely

2016-10-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546284#comment-15546284
 ] 

Sean Owen commented on SPARK-1:
---

I think you've demonstrated that this doesn't work, and I wouldn't say that's 
supposed to work. The reference you give isn't the same thing in that it 
doesn't lead to this hang / infinite loop. So I think the answer is, you can't 
do that.

> Spark Scheduler Hangs Indefinitely
> --
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: AWS EMR 4.3, can also be reproduced locally
>Reporter: Ameen Tayyebi
> Attachments: repro.scala
>
>
> We've identified a problem with Spark scheduling. The issue manifests itself 
> when an RDD calls SparkContext.parallelize within its getPartitions method. 
> This seemingly "recursive" call causes the problem. We have a repro case that 
> can easily be run.
> Please advise on what the issue might be and how we can work around it in the 
> mean time.
> I've attached repro.scala which can simply be pasted in spark-shell to 
> reproduce the problem.
> Why are we calling sc.parallelize in production within getPartitions? Well, 
> we have an RDD that is composed of several thousands of Parquet files. To 
> compute the partitioning strategy for this RDD, we create an RDD to read all 
> file sizes from S3 in parallel, so that we can quickly determine the proper 
> partitions. We do this to avoid executing this serially from the master node 
> which can result in significant slowness in the execution. Pseudo-code:
> val splitInfo = sc.parallelize(filePaths).map(f => (f, 
> s3.getObjectSummary)).collect()
> A similar logic is used in DataFrame by Spark itself:
> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902
>  
> Thanks,
> -Ameen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17777) Spark Scheduler Hangs Indefinitely

2016-10-04 Thread Ameen Tayyebi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ameen Tayyebi updated SPARK-1:
--
Description: 
We've identified a problem with Spark scheduling. The issue manifests itself 
when an RDD calls SparkContext.parallelize within its getPartitions method. 
This seemingly "recursive" call causes the problem. We have a repro case that 
can easily be run.

Please advise on what the issue might be and how we can work around it in the 
mean time.

I've attached repro.scala which can simply be pasted in spark-shell to 
reproduce the problem.

Why are we calling sc.parallelize in production within getPartitions? Well, we 
have an RDD that is composed of several thousands of Parquet files. To compute 
the partitioning strategy for this RDD, we create an RDD to read all file sizes 
from S3 in parallel, so that we can quickly determine the proper partitions. We 
do this to avoid executing this serially from the master node which can result 
in significant slowness in the execution. Pseudo-code:

val splitInfo = sc.parallelize(filePaths).map(f => (f, 
s3.getObjectSummary)).collect()

A similar logic is used in DataFrame by Spark itself:
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902
 

Thanks,
-Ameen

  was:
We've identified a problem with Spark scheduling. The issue manifests itself 
when an RDD calls SparkContext.parallelize within its getPartitions method. 
This seemingly "recursive" call causes the problem. We have a repro case that 
can easily be run.

Please advise on what the issue might be and how we can work around it in the 
mean time.

Thanks,
-Ameen


> Spark Scheduler Hangs Indefinitely
> --
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: AWS EMR 4.3, can also be reproduced locally
>Reporter: Ameen Tayyebi
> Attachments: repro.scala
>
>
> We've identified a problem with Spark scheduling. The issue manifests itself 
> when an RDD calls SparkContext.parallelize within its getPartitions method. 
> This seemingly "recursive" call causes the problem. We have a repro case that 
> can easily be run.
> Please advise on what the issue might be and how we can work around it in the 
> mean time.
> I've attached repro.scala which can simply be pasted in spark-shell to 
> reproduce the problem.
> Why are we calling sc.parallelize in production within getPartitions? Well, 
> we have an RDD that is composed of several thousands of Parquet files. To 
> compute the partitioning strategy for this RDD, we create an RDD to read all 
> file sizes from S3 in parallel, so that we can quickly determine the proper 
> partitions. We do this to avoid executing this serially from the master node 
> which can result in significant slowness in the execution. Pseudo-code:
> val splitInfo = sc.parallelize(filePaths).map(f => (f, 
> s3.getObjectSummary)).collect()
> A similar logic is used in DataFrame by Spark itself:
> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902
>  
> Thanks,
> -Ameen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546215#comment-15546215
 ] 

Apache Spark commented on SPARK-17758:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15348

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17758:


Assignee: (was: Apache Spark)

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17758:


Assignee: Apache Spark

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>Assignee: Apache Spark
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17777) Spark Scheduler Hangs Indefinitely

2016-10-04 Thread Ameen Tayyebi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ameen Tayyebi updated SPARK-1:
--
Attachment: repro.scala

> Spark Scheduler Hangs Indefinitely
> --
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: AWS EMR 4.3, can also be reproduced locally
>Reporter: Ameen Tayyebi
> Attachments: repro.scala
>
>
> We've identified a problem with Spark scheduling. The issue manifests itself 
> when an RDD calls SparkContext.parallelize within its getPartitions method. 
> This seemingly "recursive" call causes the problem. We have a repro case that 
> can easily be run.
> Please advise on what the issue might be and how we can work around it in the 
> mean time.
> Thanks,
> -Ameen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-17777) Spark Scheduler Hangs Indefinitely

2016-10-04 Thread Ameen Tayyebi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ameen Tayyebi updated SPARK-1:
--
Comment: was deleted

(was: Here's repro code:

import org.apache.spark.{Partition, SparkContext, TaskContext}
import org.apache.spark.rdd.RDD
 
class testRDD(@transient sc: SparkContext)
  extends RDD[(String, Int)](sc, Nil)
with Serializable{
 
  override def getPartitions: Array[Partition] = {
sc.parallelize(Seq(("a",1),("b",2))).reduceByKey(_+_).collect()
 
val result = new Array[Partition](4)
for (i <- 0 until 4) {
  result(i) = new Partition {
override def index: Int = 0
  }
}
result
  }
 
  override def compute(split: Partition, context: TaskContext):
Iterator[(String,Int)] = Seq(("a",3),("b",4)).iterator
}
 
val y = new testRDD(sc)
y.map(r => r).reduceByKey(_+_).count()

This can be simply pasted in spark-shell.)

> Spark Scheduler Hangs Indefinitely
> --
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: AWS EMR 4.3, can also be reproduced locally
>Reporter: Ameen Tayyebi
>
> We've identified a problem with Spark scheduling. The issue manifests itself 
> when an RDD calls SparkContext.parallelize within its getPartitions method. 
> This seemingly "recursive" call causes the problem. We have a repro case that 
> can easily be run.
> Please advise on what the issue might be and how we can work around it in the 
> mean time.
> Thanks,
> -Ameen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17777) Spark Scheduler Hangs Indefinitely

2016-10-04 Thread Ameen Tayyebi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546204#comment-15546204
 ] 

Ameen Tayyebi commented on SPARK-1:
---

Here's repro code:

import org.apache.spark.{Partition, SparkContext, TaskContext}
import org.apache.spark.rdd.RDD
 
class testRDD(@transient sc: SparkContext)
  extends RDD[(String, Int)](sc, Nil)
with Serializable{
 
  override def getPartitions: Array[Partition] = {
sc.parallelize(Seq(("a",1),("b",2))).reduceByKey(_+_).collect()
 
val result = new Array[Partition](4)
for (i <- 0 until 4) {
  result(i) = new Partition {
override def index: Int = 0
  }
}
result
  }
 
  override def compute(split: Partition, context: TaskContext):
Iterator[(String,Int)] = Seq(("a",3),("b",4)).iterator
}
 
val y = new testRDD(sc)
y.map(r => r).reduceByKey(_+_).count()

This can be simply pasted in spark-shell.

> Spark Scheduler Hangs Indefinitely
> --
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: AWS EMR 4.3, can also be reproduced locally
>Reporter: Ameen Tayyebi
>
> We've identified a problem with Spark scheduling. The issue manifests itself 
> when an RDD calls SparkContext.parallelize within its getPartitions method. 
> This seemingly "recursive" call causes the problem. We have a repro case that 
> can easily be run.
> Please advise on what the issue might be and how we can work around it in the 
> mean time.
> Thanks,
> -Ameen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17777) Spark Scheduler Hangs Indefinitely

2016-10-04 Thread Ameen Tayyebi (JIRA)

Ameen Tayyebi created SPARK-1:
-

 Summary: Spark Scheduler Hangs Indefinitely
 Key: SPARK-1
 URL: https://issues.apache.org/jira/browse/SPARK-1
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
 Environment: AWS EMR 4.3, can also be reproduced locally
Reporter: Ameen Tayyebi


We've identified a problem with Spark scheduling. The issue manifests itself 
when an RDD calls SparkContext.parallelize within its getPartitions method. 
This seemingly "recursive" call causes the problem. We have a repro case that 
can easily be run.

Please advise on what the issue might be and how we can work around it in the 
mean time.

Thanks,
-Ameen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546160#comment-15546160
 ] 

Apache Spark commented on SPARK-16827:
--

User 'dafrista' has created a pull request for this issue:
https://github.com/apache/spark/pull/15347

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16827:


Assignee: Apache Spark

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Apache Spark
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16827:


Assignee: (was: Apache Spark)

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-10-04 Thread Brian Cho (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546142#comment-15546142
 ] 

Brian Cho commented on SPARK-16827:
---

We found that the "shuffle write" metric was also including writes of spill 
files, inflating the amount of shuffle writes. This even showed up for final 
stages when no shuffle writes should take place. I'll upload screenshots on the 
PR.

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7313) Allow for configuring max_samples in range partitioner.

2016-10-04 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan closed SPARK-7313.
--
Resolution: Won't Fix

Not sure if still relevant, I opened it for an earlier project which is defunct 
now.

> Allow for configuring max_samples in range partitioner.
> ---
>
> Key: SPARK-7313
> URL: https://issues.apache.org/jira/browse/SPARK-7313
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Minor
>
> Currently, we assume that 1e6 is a reasonable upper bound to number of keys 
> while sampling. This works fine when size of keys is 'small' - but breaks for 
> anything non-trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17741) Grammar to parse top level and nested data fields separately

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17741:


Assignee: (was: Apache Spark)

> Grammar to parse top level and nested data fields separately
> 
>
> Key: SPARK-17741
> URL: https://issues.apache.org/jira/browse/SPARK-17741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Priority: Trivial
>
> Based on discussion over the dev list:
> {noformat}
> Is there any reason why Spark SQL supports "" ":" "" 
> while specifying columns ?
> eg. sql("CREATE TABLE t1 (column1:INT)") works fine. 
> Here is relevant snippet in the grammar [0]:
> ```
> colType
> : identifier ':'? dataType (COMMENT STRING)?
> ;
> ```
> I do not see MySQL[1], Hive[2], Presto[3] and PostgreSQL [4] supporting ":" 
> while specifying columns.
> They all use space as a delimiter.
> [0] : 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L596
> [1] : http://dev.mysql.com/doc/refman/5.7/en/create-table.html
> [2] : 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
> [3] : https://prestodb.io/docs/current/sql/create-table.html
> [4] : https://www.postgresql.org/docs/9.1/static/sql-createtable.html
> {noformat}
> Herman's response:
> {noformat}
> This is because we use the same rule to parse top level and nested data 
> fields. For example:
> create table tbl_x(
>   id bigint,
>   nested struct
> )
> Shows both syntaxes. We should split this rule in a top-level and nested rule.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17741) Grammar to parse top level and nested data fields separately

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546030#comment-15546030
 ] 

Apache Spark commented on SPARK-17741:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/15346

> Grammar to parse top level and nested data fields separately
> 
>
> Key: SPARK-17741
> URL: https://issues.apache.org/jira/browse/SPARK-17741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Priority: Trivial
>
> Based on discussion over the dev list:
> {noformat}
> Is there any reason why Spark SQL supports "" ":" "" 
> while specifying columns ?
> eg. sql("CREATE TABLE t1 (column1:INT)") works fine. 
> Here is relevant snippet in the grammar [0]:
> ```
> colType
> : identifier ':'? dataType (COMMENT STRING)?
> ;
> ```
> I do not see MySQL[1], Hive[2], Presto[3] and PostgreSQL [4] supporting ":" 
> while specifying columns.
> They all use space as a delimiter.
> [0] : 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L596
> [1] : http://dev.mysql.com/doc/refman/5.7/en/create-table.html
> [2] : 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
> [3] : https://prestodb.io/docs/current/sql/create-table.html
> [4] : https://www.postgresql.org/docs/9.1/static/sql-createtable.html
> {noformat}
> Herman's response:
> {noformat}
> This is because we use the same rule to parse top level and nested data 
> fields. For example:
> create table tbl_x(
>   id bigint,
>   nested struct
> )
> Shows both syntaxes. We should split this rule in a top-level and nested rule.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17741) Grammar to parse top level and nested data fields separately

2016-10-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17741:


Assignee: Apache Spark

> Grammar to parse top level and nested data fields separately
> 
>
> Key: SPARK-17741
> URL: https://issues.apache.org/jira/browse/SPARK-17741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Trivial
>
> Based on discussion over the dev list:
> {noformat}
> Is there any reason why Spark SQL supports "" ":" "" 
> while specifying columns ?
> eg. sql("CREATE TABLE t1 (column1:INT)") works fine. 
> Here is relevant snippet in the grammar [0]:
> ```
> colType
> : identifier ':'? dataType (COMMENT STRING)?
> ;
> ```
> I do not see MySQL[1], Hive[2], Presto[3] and PostgreSQL [4] supporting ":" 
> while specifying columns.
> They all use space as a delimiter.
> [0] : 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L596
> [1] : http://dev.mysql.com/doc/refman/5.7/en/create-table.html
> [2] : 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
> [3] : https://prestodb.io/docs/current/sql/create-table.html
> [4] : https://www.postgresql.org/docs/9.1/static/sql-createtable.html
> {noformat}
> Herman's response:
> {noformat}
> This is because we use the same rule to parse top level and nested data 
> fields. For example:
> create table tbl_x(
>   id bigint,
>   nested struct
> )
> Shows both syntaxes. We should split this rule in a top-level and nested rule.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17773) HiveInspector wrapper for JavaVoidObjectInspector is missing

2016-10-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546003#comment-15546003
 ] 

Apache Spark commented on SPARK-17773:
--

User 'seyfe' has created a pull request for this issue:
https://github.com/apache/spark/pull/15345

> HiveInspector wrapper for JavaVoidObjectInspector is missing 
> -
>
> Key: SPARK-17773
> URL: https://issues.apache.org/jira/browse/SPARK-17773
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Ergin Seyfe
>Assignee: Ergin Seyfe
>Priority: Minor
> Fix For: 2.1.0
>
>
> Executing following query fails.
> {noformat}
> select SOME_UDAF*(a.arr) 
> from (
>   select Array(null) as arr from dim_one_row
> ) a
> SOME_UDAF= It's an UDAF which is similar to FIRST(), but skips the null 
> values if it can and adds randomization.
> Error message:
> scala.MatchError: 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaVoidObjectInspector@39055e0d
>  (of class 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaVoidObjectInspector)
>   at 
> org.apache.spark.sql.hive.HiveInspectors$class.wrapperFor(HiveInspectors.scala:257)
>   at org.apache.spark.sql.hive.HiveUDAFFunction.wrapperFor(hiveUDFs.scala:269)
>   at 
> org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:719)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17559) PeriodicGraphCheckpointer did not persist edges as expected in some cases

2016-10-04 Thread ding (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545991#comment-15545991
 ] 

ding commented on SPARK-17559:
--

Yes, it's the correct username. Thank you for your reviewing.

> PeriodicGraphCheckpointer did not persist edges as expected in some cases
> -
>
> Key: SPARK-17559
> URL: https://issues.apache.org/jira/browse/SPARK-17559
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: ding
>Assignee: dingding
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When use PeriodicGraphCheckpointer to persist graph, sometimes the edge isn't 
> persisted. As currently only when vertices's storage level is none, graph is 
> persisted. However there is a chance vertices's storage level is not none 
> while edges's is none. Eg. graph created by a outerJoinVertices operation, 
> vertices is automatically cached while edges is not. In this way, edges will 
> not be persisted if we use PeriodicGraphCheckpointer do persist.
> See below minimum example:
>val graphCheckpointer = new PeriodicGraphCheckpointer[Array[String], 
> Int](2, sc)
> val users = sc.textFile("data/graphx/users.txt")
>   .map(line => line.split(",")).map(parts => (parts.head.toLong, 
> parts.tail))
> val followerGraph = GraphLoader.edgeListFile(sc, 
> "data/graphx/followers.txt")
> val graph = followerGraph.outerJoinVertices(users) {
>   case (uid, deg, Some(attrList)) => attrList
>   case (uid, deg, None) => Array.empty[String]
> }
> graphCheckpointer.update(graph)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 148 matches

Mail list logo