[jira] [Created] (SPARK-31184) Support getTablesByType API of Hive Client

2020-03-18 Thread Xin Wu (Jira)
Xin Wu created SPARK-31184:
--

 Summary: Support getTablesByType API of Hive Client
 Key: SPARK-31184
 URL: https://issues.apache.org/jira/browse/SPARK-31184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Xin Wu


Hive 2.3+ supports getTablesByType API, which is a precondition to implement 
SHOW VIEWS in HiveExternalCatalog. Currently, without this API, we can not get 
hive table with type HiveTableType.VIRTUAL_VIEW directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31113) Support DDL "SHOW VIEWS"

2020-03-10 Thread Xin Wu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056669#comment-17056669
 ] 

Xin Wu commented on SPARK-31113:


Sure, I'm working on this! Thanks [~smilegator]

> Support DDL "SHOW VIEWS"
> 
>
> Key: SPARK-31113
> URL: https://issues.apache.org/jira/browse/SPARK-31113
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> It is nice to have a `SHOW VIEWS` command similar to Hive 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31079) Add RuleExecutor metrics in Explain Formatted

2020-03-06 Thread Xin Wu (Jira)
Xin Wu created SPARK-31079:
--

 Summary: Add RuleExecutor metrics in Explain Formatted
 Key: SPARK-31079
 URL: https://issues.apache.org/jira/browse/SPARK-31079
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Xin Wu


RuleExecutor already support metering for analyzer/optimizer. By providing such 
information in Explain command, user can get better user experience when 
debugging a specific query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30940) Remove meaningless attributeId when Explain SQL query

2020-02-24 Thread Xin Wu (Jira)
Xin Wu created SPARK-30940:
--

 Summary: Remove meaningless attributeId when Explain SQL query
 Key: SPARK-30940
 URL: https://issues.apache.org/jira/browse/SPARK-30940
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xin Wu


When EXPLAIN sql query, the generated alias shouldn't include expr/attribute 
id. This will provide better readability of Explain results. This is a 
follow-up to address [#27368 
(comment)|https://github.com/apache/spark/pull/27368#discussion_r376927143]. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30842) Adjust abstraction structure for join operators

2020-02-15 Thread Xin Wu (Jira)
Xin Wu created SPARK-30842:
--

 Summary: Adjust abstraction structure for join operators
 Key: SPARK-30842
 URL: https://issues.apache.org/jira/browse/SPARK-30842
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Xin Wu


Currently the join operators are not well abstracted, since there are lot of 
common logic. A trait can be created for easier pattern matching and other 
future handiness. 
 
This is a follow up based on comment 
[https://github.com/apache/spark/pull/27509#discussion_r379613391]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30765) Refine baes class abstraction code style

2020-02-09 Thread Xin Wu (Jira)
Xin Wu created SPARK-30765:
--

 Summary: Refine baes class abstraction code style
 Key: SPARK-30765
 URL: https://issues.apache.org/jira/browse/SPARK-30765
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Xin Wu


When doing base operator abstraction work, I found there are still some code 
snippet is  inconsistent with other abstraction code style.
 
Case 1, override keyword missed for some fields in derived classes. The 
compiler will not capture it if we rename some fields in the future.
[https://github.com/apache/spark/pull/27368#discussion_r376694045]
 
 
Case 2, inconsistent abstract class definition. The updated style will simplify 
derived class definition.
[https://github.com/apache/spark/pull/27368#discussion_r375061952]
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30764) Improve the readability of EXPLAIN FORMATTED style

2020-02-09 Thread Xin Wu (Jira)
Xin Wu created SPARK-30764:
--

 Summary: Improve the readability of EXPLAIN FORMATTED style
 Key: SPARK-30764
 URL: https://issues.apache.org/jira/browse/SPARK-30764
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Xin Wu


The style of EXPLAIN FORMATTED output needs to be improved. We’ve already got 
some observations/ideas in
[https://github.com/apache/spark/pull/27368#discussion_r376694496]. 
 
TODOs:
1.Using comma as the separator is not clear, especially commas are used inside 
the expressions too.
2.Show the column counts first? For example, `Results [4]: …`
3.Currently the attribute names are automatically generated, this need to 
refined.
4.Add arguments field in common implications as EXPLAIN FORMATTED did in 
QueryPlan
...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30652) EXPLAIN EXTENDED does not show detail information for aggregate operators

2020-01-27 Thread Xin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu resolved SPARK-30652.

Resolution: Duplicate

> EXPLAIN EXTENDED does not show detail information for aggregate operators
> -
>
> Key: SPARK-30652
> URL: https://issues.apache.org/jira/browse/SPARK-30652
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xin Wu
>Priority: Major
>
> Currently EXPLAIN FORMATTED only report input attributes of 
> HashAggregate/ObjectHashAggregate/SortAggregate. While EXPLAIN EXTENDED 
> provides more information. We need to enhance EXPLAIN FORMATTED to follow the 
> original behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30652) EXPLAIN EXTENDED does not show detail information for aggregate operators

2020-01-27 Thread Xin Wu (Jira)
Xin Wu created SPARK-30652:
--

 Summary: EXPLAIN EXTENDED does not show detail information for 
aggregate operators
 Key: SPARK-30652
 URL: https://issues.apache.org/jira/browse/SPARK-30652
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xin Wu


Currently EXPLAIN FORMATTED only report input attributes of 
HashAggregate/ObjectHashAggregate/SortAggregate. While EXPLAIN EXTENDED 
provides more information. We need to enhance EXPLAIN FORMATTED to follow the 
original behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30651) EXPLAIN EXTENDED does not show detail information for aggregate operators

2020-01-27 Thread Xin Wu (Jira)
Xin Wu created SPARK-30651:
--

 Summary: EXPLAIN EXTENDED does not show detail information for 
aggregate operators
 Key: SPARK-30651
 URL: https://issues.apache.org/jira/browse/SPARK-30651
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xin Wu


Currently EXPLAIN FORMATTED only report input attributes of 
HashAggregate/ObjectHashAggregate/SortAggregate. While EXPLAIN EXTENDED 
provides more information. We need to enhance EXPLAIN FORMATTED to follow the 
original behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30326) Raise exception if analyzer exceed max iterations

2019-12-22 Thread Xin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-30326:
---
Description: Currently, both analyzer and optimizer just log warning 
message if rule execution exceed max iterations. They should have different 
behavior. Analyzer should raise exception to indicates the plan is not fixed 
after max iterations, while optimizer just log warning to keep the current 
plan. This is more feasible after SPARK-30138 was introduced.  (was: Currently, 
both analyzer and optimizer just log warning message if rule execution exceed 
max iterations. They should have different behavior. Analyzer should raise 
exception to indicates logical plan resolve failed, while optimizer just log 
warning to keep the current plan. This is more feasible after SPARK-30138 was 
introduced.)

> Raise exception if analyzer exceed max iterations
> -
>
> Key: SPARK-30326
> URL: https://issues.apache.org/jira/browse/SPARK-30326
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xin Wu
>Priority: Major
>
> Currently, both analyzer and optimizer just log warning message if rule 
> execution exceed max iterations. They should have different behavior. 
> Analyzer should raise exception to indicates the plan is not fixed after max 
> iterations, while optimizer just log warning to keep the current plan. This 
> is more feasible after SPARK-30138 was introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30326) Raise exception if analyzer exceed max iterations

2019-12-21 Thread Xin Wu (Jira)
Xin Wu created SPARK-30326:
--

 Summary: Raise exception if analyzer exceed max iterations
 Key: SPARK-30326
 URL: https://issues.apache.org/jira/browse/SPARK-30326
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xin Wu


Currently, both analyzer and optimizer just log warning message if rule 
execution exceed max iterations. They should have different behavior. Analyzer 
should raise exception to indicates logical plan resolve failed, while 
optimizer just log warning to keep the current plan. This is more feasible 
after SPARK-30138 was introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987976#comment-15987976
 ] 

Xin Wu commented on SPARK-18727:


Thanks! 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987939#comment-15987939
 ] 

Xin Wu commented on SPARK-18727:


[~ekhliang] I see. I will try to support ALTER TABLE SCHEMA. Also this is 
similar or the same as ALTER TABLE REPLACE COLUMNS, which is documented as 
unsupported Hive feature in SqlBase.q4.. Do we have preference which one to use?

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987874#comment-15987874
 ] 

Xin Wu commented on SPARK-18727:


[~ekhliang] First of all, i am not sure whether it is wise to introduce more 
non-SQL standard syntax into Spark's DDL.  In addition, with ALTER TABLE 
SCHEMA, or ALTER TABLE SET/UPDATE/MOIDFY SCHEMA, depending however we call it, 
it requires users to put in the whole list of columns' definition for maybe a 
small change of a column. It is inconvenient especially when the table is 
relatively wide. What do you think [~smilegator] ? 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987830#comment-15987830
 ] 

Xin Wu commented on SPARK-18727:


[~simeons] You are right.. My PR does not include the feature that allows you 
to add new field into a complex type. Such feature could be supported by 
{code}ALTER TABLE CHANGE COLUMN   {code}, where 
newType has newly added fields. 

I am also working on this part. 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987642#comment-15987642
 ] 

Xin Wu commented on SPARK-18727:


FYI. I have https://github.com/apache/spark/pull/16626 for ALTER TABLE ADD 
COLUMNS merged into 2.2. 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-04-11 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964969#comment-15964969
 ] 

Xin Wu commented on SPARK-20256:


Yes. I am working on it. 
My proposal is to revert the SPARK-18050 change, then add a try-catch over 
externalCatalog.createDatabase(...)  and log the error of existing default 
database from Hive into DEBUG log. 

I am trying to create a unit-test case to simulate the permission issue, which 
I have some difficulty. 



> Fail to start SparkContext/SparkSession with Hive support enabled when user 
> does not have read/write privilege to Hive metastore warehouse dir
> --
>
> Key: SPARK-20256
> URL: https://issues.apache.org/jira/browse/SPARK-20256
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Xin Wu
>Priority: Critical
>
> In a cluster setup with production Hive running, when the user wants to run 
> spark-shell using the production Hive metastore, hive-site.xml is copied to 
> SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
> database existence of "default" database from Hive metastore. Yet, since this 
> user may not have READ/WRITE access to the configured Hive warehouse 
> directory done by Hive itself, such permission error will prevent spark-shell 
> or any spark application with Hive support enabled from starting at all. 
> Example error:
> {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
>   at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   ... 47 elided
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.security.AccessControlException: Permission 
> denied: user=notebook, access=READ, 
> inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> 

[jira] [Commented] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-04-07 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961193#comment-15961193
 ] 

Xin Wu commented on SPARK-20256:


I am working on a fix and creating simulated test cases for this issue. 

> Fail to start SparkContext/SparkSession with Hive support enabled when user 
> does not have read/write privilege to Hive metastore warehouse dir
> --
>
> Key: SPARK-20256
> URL: https://issues.apache.org/jira/browse/SPARK-20256
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Xin Wu
>Priority: Critical
>
> In a cluster setup with production Hive running, when the user wants to run 
> spark-shell using the production Hive metastore, hive-site.xml is copied to 
> SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
> database existence of "default" database from Hive metastore. Yet, since this 
> user may not have READ/WRITE access to the configured Hive warehouse 
> directory done by Hive itself, such permission error will prevent spark-shell 
> or any spark application with Hive support enabled from starting at all. 
> Example error:
> {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
>   at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   ... 47 elided
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.security.AccessControlException: Permission 
> denied: user=notebook, access=READ, 
> inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
> );
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> 

[jira] [Updated] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-04-07 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-20256:
---
Description: 
In a cluster setup with production Hive running, when the user wants to run 
spark-shell using the production Hive metastore, hive-site.xml is copied to 
SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
database existence of "default" database from Hive metastore. Yet, since this 
user may not have READ/WRITE access to the configured Hive warehouse directory 
done by Hive itself, such permission error will prevent spark-shell or any 
spark application with Hive support enabled from starting at all. 

Example error:
{code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':
  at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
  at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
  at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
  at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
  at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
  at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
  ... 47 elided
Caused by: java.lang.reflect.InvocationTargetException: 
org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.security.AccessControlException: Permission denied: 
user=notebook, access=READ, inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
);
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
  ... 58 more
Caused by: org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.security.AccessControlException: Permission denied: 
user=notebook, access=READ, inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
at 

[jira] [Created] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-04-07 Thread Xin Wu (JIRA)
Xin Wu created SPARK-20256:
--

 Summary: Fail to start SparkContext/SparkSession with Hive support 
enabled when user does not have read/write privilege to Hive metastore 
warehouse dir
 Key: SPARK-20256
 URL: https://issues.apache.org/jira/browse/SPARK-20256
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.1.1, 2.2.0
Reporter: Xin Wu
Priority: Critical


In a cluster setup with production Hive running, when the user wants to run 
spark-shell using the production Hive metastore, hive-site.xml is copied to 
SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
database existence of "default" database from Hive metastore. Yet, since this 
user may not have READ/WRITE access to the configured Hive warehouse directory 
done by Hive itself, such permission error will prevent spark-shell or any 
spark application with Hive support enabled from starting at all. 

Example error:
{code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':
  at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
  at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
  at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
  at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
  at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
  at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
  ... 47 elided
Caused by: java.lang.reflect.InvocationTargetException: 
org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.security.AccessControlException: Permission denied: 
user=notebook, access=READ, inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
);
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
  ... 58 more
Caused by: org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 

[jira] [Updated] (SPARK-19539) CREATE TEMPORARY TABLE needs to avoid existing temp view

2017-02-09 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-19539:
---
Summary: CREATE TEMPORARY TABLE needs to avoid existing temp view  (was: 
CREATE TEMPORARY TABLE need to avoid existing temp view)

> CREATE TEMPORARY TABLE needs to avoid existing temp view
> 
>
> Key: SPARK-19539
> URL: https://issues.apache.org/jira/browse/SPARK-19539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xin Wu
>
> Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to 
> use "CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" 
> clause.  However, if there is an existing temporary view defined, it is 
> possible to unintentionally replace this existing view by issuing "CREATE 
> TEMPORARY TABLE ... " with the same table/view name. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19539) CREATE TEMPORARY TABLE need to avoid existing temp view

2017-02-09 Thread Xin Wu (JIRA)
Xin Wu created SPARK-19539:
--

 Summary: CREATE TEMPORARY TABLE need to avoid existing temp view
 Key: SPARK-19539
 URL: https://issues.apache.org/jira/browse/SPARK-19539
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xin Wu


Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to use 
"CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" clause.  
However, if there is an existing temporary view defined, it is possible to 
unintentionally replace this existing view by issuing "CREATE TEMPORARY TABLE 
... " with the same table/view name. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2017-01-25 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu closed SPARK-15463.
--
Resolution: Later

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2016-12-07 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729696#comment-15729696
 ] 

Xin Wu commented on SPARK-18727:


I am currently working on ALTER TABLE ADD COLUMNS, to tables with provider = 
hive and will submit a PR soon. Just wondering whether it will solve part of 
this JIRA. Please advise! Thanks!

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723888#comment-15723888
 ] 

Xin Wu edited comment on SPARK-18539 at 12/6/16 12:46 AM:
--

I think we will hit the issue if we use user-specified schema. Here is what I 
tried in spark-shell built from master branch:
{code}
val df = spark.range(1).coalesce(1)
df.selectExpr("id AS 
a").write.parquet("/Users/xinwu/spark-test/data/spark-18539")
val schema = StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType)))
spark.read.option("mergeSchema", 
"true").schema(schema).parquet("/Users/xinwu/spark-test/data/spark-18539").filter("b
 is null").count()
{code}

The exception is 
{code}
Caused by: java.lang.IllegalArgumentException: Column [b] was not found in 
schema!
  at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
  at 
org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
  at 
org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:308)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:63)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
  at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
  at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377)
{code}

Here I have one parquet file missing column b and query with user-specified 
schema (a, b). 




was (Author: xwu0226):
I think we will hit the issue if we use user-specified schema. Here is what I 
tried in spark-shell built from master branch:
{code}
val df = spark.range(1).coalesce(1)
df.selectExpr("id AS 
a").write.parquet("/Users/xinwu/spark-test/data/spark-18539")
val schema = StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType)))
spark.read.option("mergeSchema", 
"true").schema(schema).parquet("/Users/xinwu/spark-test/data/spark-18539").filter("b
 < 0").count()
{code}

The exception is 
{code}
Caused by: java.lang.IllegalArgumentException: Column [b] was not found in 
schema!
  at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
  at 
org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
  at 
org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:308)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:63)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
  at 

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723888#comment-15723888
 ] 

Xin Wu commented on SPARK-18539:


I think we will hit the issue if we use user-specified schema. Here is what I 
tried in spark-shell built from master branch:
{code}
val df = spark.range(1).coalesce(1)
df.selectExpr("id AS 
a").write.parquet("/Users/xinwu/spark-test/data/spark-18539")
val schema = StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType)))
spark.read.option("mergeSchema", 
"true").schema(schema).parquet("/Users/xinwu/spark-test/data/spark-18539").filter("b
 < 0").count()
{code}

The exception is 
{code}
Caused by: java.lang.IllegalArgumentException: Column [b] was not found in 
schema!
  at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
  at 
org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
  at 
org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:308)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:63)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
  at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
  at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377)
{code}

Here I have one parquet file missing column b and query with user-specified 
schema (a, b). 



> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> 

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723296#comment-15723296
 ] 

Xin Wu commented on SPARK-18539:


Yes. I have the fix and will submit PR and cc everyone for review.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at 

[jira] [Created] (SPARK-17551) support null ordering for DataFrame API

2016-09-14 Thread Xin Wu (JIRA)
Xin Wu created SPARK-17551:
--

 Summary: support null ordering for DataFrame API
 Key: SPARK-17551
 URL: https://issues.apache.org/jira/browse/SPARK-17551
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xin Wu


SPARK-10747 has added support for NULLS FIRST | LAST in ORDER BY clause for SQL 
interface. This JIRA is to complete this feature by adding same support for 
DataFrame/Dataset APIs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10747) add support for NULLS FIRST|LAST in ORDER BY clause

2016-08-26 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-10747:
---
Summary: add support for NULLS FIRST|LAST in ORDER BY clause  (was: add 
support for window specification to include how NULLS are ordered)

> add support for NULLS FIRST|LAST in ORDER BY clause
> ---
>
> Key: SPARK-10747
> URL: https://issues.apache.org/jira/browse/SPARK-10747
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> You cannot express how NULLS are to be sorted in the window order 
> specification and have to use a compensating expression to simulate.
> Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
> near 'nulls'
> line 1:82 missing EOF at 'last' near 'nulls';
> SQLState:  null
> Same limitation as Hive reported in Apache JIRA HIVE-9535 )
> This fails
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
> nulls last) from tolap
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
> c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10747) add support for window specification to include how NULLS are ordered

2016-08-26 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440656#comment-15440656
 ] 

Xin Wu commented on SPARK-10747:


This JIRA may be changed to support NULLS FIRST|LAST feature in ORDER BY 
clause. 

> add support for window specification to include how NULLS are ordered
> -
>
> Key: SPARK-10747
> URL: https://issues.apache.org/jira/browse/SPARK-10747
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> You cannot express how NULLS are to be sorted in the window order 
> specification and have to use a compensating expression to simulate.
> Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
> near 'nulls'
> line 1:82 missing EOF at 'last' near 'nulls';
> SQLState:  null
> Same limitation as Hive reported in Apache JIRA HIVE-9535 )
> This fails
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
> nulls last) from tolap
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
> c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10747) add support for window specification to include how NULLS are ordered

2016-08-26 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-10747:
---
Issue Type: New Feature  (was: Improvement)

> add support for window specification to include how NULLS are ordered
> -
>
> Key: SPARK-10747
> URL: https://issues.apache.org/jira/browse/SPARK-10747
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> You cannot express how NULLS are to be sorted in the window order 
> specification and have to use a compensating expression to simulate.
> Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
> near 'nulls'
> line 1:82 missing EOF at 'last' near 'nulls';
> SQLState:  null
> Same limitation as Hive reported in Apache JIRA HIVE-9535 )
> This fails
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
> nulls last) from tolap
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
> c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-08-25 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438457#comment-15438457
 ] 

Xin Wu edited comment on SPARK-14927 at 8/26/16 4:46 AM:
-

[~smilegator] Do you think what you are working on will fix this issue by the 
way? This is to allow hive to see the partitions created by SparkSQL from a 
data frame. 


was (Author: xwu0226):
[~smilegator] Do you think what you are working on regarding will fix this 
issue? This is to allow hive to see the partitions created by SparkSQL from a 
data frame. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-08-25 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438457#comment-15438457
 ] 

Xin Wu commented on SPARK-14927:


[~smilegator] Do you think what you are working on regarding will fix this 
issue? This is to allow hive to see the partitions created by SparkSQL from a 
data frame. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10747) add support for window specification to include how NULLS are ordered

2016-08-17 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425547#comment-15425547
 ] 

Xin Wu commented on SPARK-10747:


[~hvanhovell] Yes. Since we have native parser now, we can do this within 
SparkSQL. I can work on this. Thanks!

> add support for window specification to include how NULLS are ordered
> -
>
> Key: SPARK-10747
> URL: https://issues.apache.org/jira/browse/SPARK-10747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> You cannot express how NULLS are to be sorted in the window order 
> specification and have to use a compensating expression to simulate.
> Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
> near 'nulls'
> line 1:82 missing EOF at 'last' near 'nulls';
> SQLState:  null
> Same limitation as Hive reported in Apache JIRA HIVE-9535 )
> This fails
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
> nulls last) from tolap
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
> c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source

2016-08-05 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-16924:
---
Issue Type: Improvement  (was: Bug)

> DataStreamReader can not support option("inferSchema", true/false) for csv 
> and json file source
> ---
>
> Key: SPARK-16924
> URL: https://issues.apache.org/jira/browse/SPARK-16924
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> Currently DataStreamReader can not support option("inferSchema", true|false) 
> for csv and json file source. It only takes SQLConf setting 
> "spark.sql.streaming.schemaInference", which needs to be set at session 
> level. 
> For example:
> {code}
> scala> val in = spark.readStream.format("json").option("inferSchema", 
> true).load("/Users/xinwu/spark-test/data/json/t1")
> java.lang.IllegalArgumentException: Schema must be specified when creating a 
> streaming source DataFrame. If some files already exist in the directory, 
> then depending on the file format you may be able to create a static 
> DataFrame on that directory with 'spark.read.load(directory)' and infer 
> schema from it.
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
>   ... 48 elided
> scala> val in = spark.readStream.format("csv").option("inferSchema", 
> true).load("/Users/xinwu/spark-test/data/csv")
> java.lang.IllegalArgumentException: Schema must be specified when creating a 
> streaming source DataFrame. If some files already exist in the directory, 
> then depending on the file format you may be able to create a static 
> DataFrame on that directory with 'spark.read.load(directory)' and infer 
> schema from it.
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
>   ... 48 elided
> {code}
> In the example, even though users specify the option("inferSchema", true), it 
> does not take it. But for batch data, DataFrameReader can take it:
> {code}
> scala> val in = spark.read.format("csv").option("header", 
> true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1")
> in: org.apache.spark.sql.DataFrame = [signal: string, flash: int]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source

2016-08-05 Thread Xin Wu (JIRA)
Xin Wu created SPARK-16924:
--

 Summary: DataStreamReader can not support option("inferSchema", 
true/false) for csv and json file source
 Key: SPARK-16924
 URL: https://issues.apache.org/jira/browse/SPARK-16924
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xin Wu


Currently DataStreamReader can not support option("inferSchema", true|false) 
for csv and json file source. It only takes SQLConf setting 
"spark.sql.streaming.schemaInference", which needs to be set at session level. 

For example:
{code}
scala> val in = spark.readStream.format("json").option("inferSchema", 
true).load("/Users/xinwu/spark-test/data/json/t1")
java.lang.IllegalArgumentException: Schema must be specified when creating a 
streaming source DataFrame. If some files already exist in the directory, then 
depending on the file format you may be able to create a static DataFrame on 
that directory with 'spark.read.load(directory)' and infer schema from it.
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
  at 
org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
  at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
  at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
  ... 48 elided

scala> val in = spark.readStream.format("csv").option("inferSchema", 
true).load("/Users/xinwu/spark-test/data/csv")
java.lang.IllegalArgumentException: Schema must be specified when creating a 
streaming source DataFrame. If some files already exist in the directory, then 
depending on the file format you may be able to create a static DataFrame on 
that directory with 'spark.read.load(directory)' and infer schema from it.
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
  at 
org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
  at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
  at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
  ... 48 elided

{code}
In the example, even though users specify the option("inferSchema", true), it 
does not take it. But for batch data, DataFrameReader can take it:
{code}
scala> val in = spark.read.format("csv").option("header", 
true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1")
in: org.apache.spark.sql.DataFrame = [signal: string, flash: int]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9761) Inconsistent metadata handling with ALTER TABLE

2016-08-04 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408405#comment-15408405
 ] 

Xin Wu commented on SPARK-9761:
---

[~drwinters] Spark 2.0 has support DDL commands, which means it gives the 
opportunity of implementing the ALTER TABLE ADD/CHANG COLUMNS, that is not 
supported yet in current released Spark 2.0.  Spark 2.1 will have some change 
also in the native DDL infrastructure. I think once this is settled, it will be 
easier to support this. I am looking into this also. 

> Inconsistent metadata handling with ALTER TABLE
> ---
>
> Key: SPARK-9761
> URL: https://issues.apache.org/jira/browse/SPARK-9761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: hive, sql
>
> Schema changes made with {{ALTER TABLE}} are not shown in {{DESCRIBE TABLE}}. 
> The table in question was created with {{HiveContext.read.json()}}.
> Steps:
> # {{alter table dimension_components add columns (z string);}} succeeds.
> # {{describe dimension_components;}} does not show the new column, even after 
> restarting spark-sql.
> # A second {{alter table dimension_components add columns (z string);}} fails 
> with RROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Duplicate column name: z
> Full spark-sql output 
> [here|https://gist.github.com/ssimeonov/d9af4b8bb76b9d7befde].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16605) Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports

2016-07-18 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383082#comment-15383082
 ] 

Xin Wu edited comment on SPARK-16605 at 7/18/16 9:17 PM:
-

The current issue for dealing with ORC data inserted by Hive is that the schema 
stored in orc file inserted by hive is using dummy column name such as "_col1, 
_col2, ...". Hive knows how to read the data. However, in Spark SQL, for 
performance gain, it tries to convert ORC table to its native ORC relation for 
scanning, in that it infers schema from orc file directly but getting the table 
schema from hive metastore. There are then mismatch here. 

Try the workaround that turns off this conversion for performance: 
{code}set spark.sql.hive.convertMetastoreOrc=false{code}

Then, see if it works. 


was (Author: xwu0226):
The current issue for dealing with ORC data inserted by Hive is that the schema 
stored in orc file inserted by hive is using dummy column name such as "_col1, 
_col2, ...". Hive knows how to read the data. However, in Spark SQL, for 
performance gain, it tries to convert ORC table to its native ORC relation for 
scanning, in that it infers schema from orc file directly but getting the table 
schema from hive megastore. There are then mismatch here. 

Try the workaround that turns off this conversion for performance: 
{code}set spark.sql.hive.convertMetastoreOrc=false{code}

Then, see if it works. 

> Spark2.0 cannot "select" data from a table stored as an orc file which has 
> been created by hive while hive or spark1.6 supports
> ---
>
> Key: SPARK-16605
> URL: https://issues.apache.org/jira/browse/SPARK-16605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: marymwu
> Attachments: screenshot-1.png
>
>
> Spark2.0 cannot "select" data from a table stored as an orc file which has 
> been created by hive while hive or spark1.6 supports
> Steps:
> 1. Use hive to create a table "tbtxt" stored as txt and load data into it.
> 2. Use hive to create a table "tborc" stored as orc and insert the data from 
> table "tbtxt" . Example, "create table tborc stored as orc as select * from 
> tbtxt"
> 3. Use spark2.0 to "select * from tborc;".-->error 
> occurs,java.lang.IllegalArgumentException: Field "nid" does not exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16605) Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports

2016-07-18 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383082#comment-15383082
 ] 

Xin Wu commented on SPARK-16605:


The current issue for dealing with ORC data inserted by Hive is that the schema 
stored in orc file inserted by hive is using dummy column name such as "_col1, 
_col2, ...". Hive knows how to read the data. However, in Spark SQL, for 
performance gain, it tries to convert ORC table to its native ORC relation for 
scanning, in that it infers schema from orc file directly but getting the table 
schema from hive megastore. There are then mismatch here. 

Try the workaround that turns off this conversion for performance: 
{code}set spark.sql.hive.convertMetastoreOrc=false{code}

Then, see if it works. 

> Spark2.0 cannot "select" data from a table stored as an orc file which has 
> been created by hive while hive or spark1.6 supports
> ---
>
> Key: SPARK-16605
> URL: https://issues.apache.org/jira/browse/SPARK-16605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: marymwu
> Attachments: screenshot-1.png
>
>
> Spark2.0 cannot "select" data from a table stored as an orc file which has 
> been created by hive while hive or spark1.6 supports
> Steps:
> 1. Use hive to create a table "tbtxt" stored as txt and load data into it.
> 2. Use hive to create a table "tborc" stored as orc and insert the data from 
> table "tbtxt" . Example, "create table tborc stored as orc as select * from 
> tbtxt"
> 3. Use spark2.0 to "select * from tborc;".-->error 
> occurs,java.lang.IllegalArgumentException: Field "nid" does not exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15970) WARNing message related to persisting table to Hive Megastore while Spark SQL is running in-memory catalog mode

2016-06-15 Thread Xin Wu (JIRA)
Xin Wu created SPARK-15970:
--

 Summary: WARNing message related to persisting table to Hive 
Megastore while Spark SQL is running in-memory catalog mode
 Key: SPARK-15970
 URL: https://issues.apache.org/jira/browse/SPARK-15970
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xin Wu
Priority: Minor


When we run Spark-shell in In-Memory catalog mode, creating a datasource table 
that is not compatible with hive will show a warning messaging saying it can 
not persist the table in hive compatible way. However, In-Memory catalog mode 
should not involve in trying to persist table in hive megastore at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables

2016-06-02 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313257#comment-15313257
 ] 

Xin Wu edited comment on SPARK-15705 at 6/2/16 11:15 PM:
-

I can recreate it now. and will look into it. This is different issue than 
SPARK-14959


was (Author: xwu0226):
I can recreate it now. and will look into it. 

> Spark won't read ORC schema from metastore for partitioned tables
> -
>
> Key: SPARK-15705
> URL: https://issues.apache.org/jira/browse/SPARK-15705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1)
>Reporter: Nic Eggert
>
> Spark does not seem to read the schema from the Hive metastore for 
> partitioned tables stored as ORC files. It appears to read the schema from 
> the files themselves, which, if they were created with Hive, does not match 
> the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To 
> reproduce:
> In Hive:
> {code}
> hive> create table default.test (id BIGINT, name STRING) partitioned by 
> (state STRING) stored as orc;
> hive> insert into table default.test partition (state="CA") values (1, 
> "mike"), (2, "steve"), (3, "bill");
> {code}
> In Spark
> {code}
> scala> spark.table("default.test").printSchema
> {code}
> Expected result: Spark should preserve the column names that were defined in 
> Hive.
> Actual Result:
> {code}
> root
>  |-- _col0: long (nullable = true)
>  |-- _col1: string (nullable = true)
>  |-- state: string (nullable = true)
> {code}
> Possibly related to SPARK-14959?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables

2016-06-02 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313257#comment-15313257
 ] 

Xin Wu commented on SPARK-15705:


I can recreate it now. and will look into it. 

> Spark won't read ORC schema from metastore for partitioned tables
> -
>
> Key: SPARK-15705
> URL: https://issues.apache.org/jira/browse/SPARK-15705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1)
>Reporter: Nic Eggert
>
> Spark does not seem to read the schema from the Hive metastore for 
> partitioned tables stored as ORC files. It appears to read the schema from 
> the files themselves, which, if they were created with Hive, does not match 
> the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To 
> reproduce:
> In Hive:
> {code}
> hive> create table default.test (id BIGINT, name STRING) partitioned by 
> (state STRING) stored as orc;
> hive> insert into table default.test partition (state="CA") values (1, 
> "mike"), (2, "steve"), (3, "bill");
> {code}
> In Spark
> {code}
> scala> spark.table("default.test").printSchema
> {code}
> Expected result: Spark should preserve the column names that were defined in 
> Hive.
> Actual Result:
> {code}
> root
>  |-- _col0: long (nullable = true)
>  |-- _col1: string (nullable = true)
>  |-- state: string (nullable = true)
> {code}
> Possibly related to SPARK-14959?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15710) Exception with WHERE clause in SQL for non-default Hive database

2016-06-02 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313193#comment-15313193
 ] 

Xin Wu commented on SPARK-15710:


hmm.. after another rebase of the master. it seems that the problem is gone, 
even for pyspark:

{code}
>>> spark.sql("CREATE DATABASE IF NOT EXISTS test2")
16/06/02 15:16:10 WARN ObjectStore: Failed to get database test2, returning 
NoSuchObjectException
DataFrame[]
>>> spark.sql("USE test2")
DataFrame[]
>>> df = spark.createDataFrame([
... (0, "a", 10),
... (1, "b", 11),
... (2, "c", 12),
... (3, "a", 14),
... (4, "a", 17),
... (5, "c", 18)
... ], ["id", "category", "age"])
>>> df.write.saveAsTable('test6', mode='overwrite')
Jun 2, 2016 3:14:01 PM WARNING: org.apache.parquet.hadoop.MemoryManager: Total 
allocation exceeds 95.00% (906,992,000 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
Jun 2, 2016 3:16:43 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parq16/06/02 15:16:43 WARN HiveMetaStore: Location: 
file:/Users/xinwu/spark/spark-warehouse/test2.db/test6 specified for 
non-external table:test6
>>> spark.sql("SELECT * FROM test6 WHERE id = 2").take(1)
[Row(id=2, category=u'c', age=12)]
>>> spark.sql("SELECT * FROM test6 WHERE id = 2").show()
+---++---+
| id|category|age|
+---++---+
|  2|   c| 12|
+---++---+
{code}

> Exception with WHERE clause in SQL for non-default Hive database
> 
>
> Key: SPARK-15710
> URL: https://issues.apache.org/jira/browse/SPARK-15710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: databricks community edition 2.0 preview
>Reporter: Igor Fridman
>
> The following code throws an exception only with non-default database. If I 
> use 'default' database it works.
> {code}
> spark.sql("CREATE DATABASE IF NOT EXISTS test")
> spark.sql("USE test")
> df = spark.createDataFrame([
> (0, "a", 10),
> (1, "b", 11),
> (2, "c", 12),
> (3, "a", 14),
> (4, "a", 17),
> (5, "c", 18)
> ], ["id", "category", "age"])
> df.write.saveAsTable('test', mode='overwrite')
> spark.sql("SELECT * FROM test WHERE id = 2").take(1)
> {code}
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>  13 df.write.saveAsTable('test', mode='overwrite')
>  14 
> ---> 15 spark.sql("SELECT * FROM test WHERE id = 2").take(1)
> /databricks/spark/python/pyspark/sql/dataframe.py in take(self, num)
> 333 with SCCallSiteSync(self._sc) as css:
> 334 port = 
> self._sc._jvm.org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe(
> --> 335 self._jdf, num)
> 336 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 337 
> /databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 931 answer = self.gateway_client.send_command(command)
> 932 return_value = get_return_value(
> --> 933 answer, self.gateway_client, self.target_id, self.name)
> 934 
> 935 for temp_arg in temp_args:
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 310 raise Py4JJavaError(
> 311 "An error occurred while calling {0}{1}{2}.\n".
> --> 312 format(target_id, ".", name), value)
> 313 else:
> 314 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe.
> : java.lang.ClassNotFoundException: 
> org.apache.parquet.filter2.predicate.ValidTypeMap$FullTypeDescriptor
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:264)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.relaxParquetValidTypeMap$lzycompute(ParquetFilters.scala:321)
>   at 
> 

[jira] [Commented] (SPARK-15710) Exception with WHERE clause in SQL for non-default Hive database

2016-06-02 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312851#comment-15312851
 ] 

Xin Wu commented on SPARK-15710:


I see. pyspark does not work.

> Exception with WHERE clause in SQL for non-default Hive database
> 
>
> Key: SPARK-15710
> URL: https://issues.apache.org/jira/browse/SPARK-15710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: databricks community edition 2.0 preview
>Reporter: Igor Fridman
>
> The following code throws an exception only with non-default database. If I 
> use 'default' database it works.
> {code}
> spark.sql("CREATE DATABASE IF NOT EXISTS test")
> spark.sql("USE test")
> df = spark.createDataFrame([
> (0, "a", 10),
> (1, "b", 11),
> (2, "c", 12),
> (3, "a", 14),
> (4, "a", 17),
> (5, "c", 18)
> ], ["id", "category", "age"])
> df.write.saveAsTable('test', mode='overwrite')
> spark.sql("SELECT * FROM test WHERE id = 2").take(1)
> {code}
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>  13 df.write.saveAsTable('test', mode='overwrite')
>  14 
> ---> 15 spark.sql("SELECT * FROM test WHERE id = 2").take(1)
> /databricks/spark/python/pyspark/sql/dataframe.py in take(self, num)
> 333 with SCCallSiteSync(self._sc) as css:
> 334 port = 
> self._sc._jvm.org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe(
> --> 335 self._jdf, num)
> 336 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 337 
> /databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 931 answer = self.gateway_client.send_command(command)
> 932 return_value = get_return_value(
> --> 933 answer, self.gateway_client, self.target_id, self.name)
> 934 
> 935 for temp_arg in temp_args:
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 310 raise Py4JJavaError(
> 311 "An error occurred while calling {0}{1}{2}.\n".
> --> 312 format(target_id, ".", name), value)
> 313 else:
> 314 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe.
> : java.lang.ClassNotFoundException: 
> org.apache.parquet.filter2.predicate.ValidTypeMap$FullTypeDescriptor
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:264)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.relaxParquetValidTypeMap$lzycompute(ParquetFilters.scala:321)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.relaxParquetValidTypeMap(ParquetFilters.scala:319)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.createFilter(ParquetFilters.scala:231)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$7.apply(ParquetFileFormat.scala:309)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$7.apply(ParquetFileFormat.scala:309)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReader(ParquetFileFormat.scala:309)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReaderWithPartitionValues(ParquetFileFormat.scala:268)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:112)

[jira] [Commented] (SPARK-15710) Exception with WHERE clause in SQL for non-default Hive database

2016-06-02 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312836#comment-15312836
 ] 

Xin Wu commented on SPARK-15710:


Hmm. I can not recreate it on latest master branch. Here is my steps:
{code}
scala> spark.sql("create database if not exists test")
res0: org.apache.spark.sql.DataFrame = []

scala> spark.sql("use test")
res1: org.apache.spark.sql.DataFrame = []

scala> case class AgeData(id: Int, category: String,  age: Int)
defined class AgeData

scala> val ds = spark.createDataFrame( Seq(AgeData(0, "a", 10), AgeData(1, "b", 
11), AgeData(2, "c", 12))) 
ds: org.apache.spark.sql.DataFrame = [id: int, category: string ... 1 more 
field]

scala> ds.show
+---++---+
| id|category|age|
+---++---+
|  0|   a| 10|
|  1|   b| 11|
|  2|   c| 12|
+---++---+

scala> 
ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).saveAsTable("test2")

scala> spark.sql("select * from test2 where id = 2").show
+---++---+
| id|category|age|
+---++---+
|  2|   c| 12|
+---++---+


scala> spark.sql("select * from test2 where id = 2").take(1)
res9: Array[org.apache.spark.sql.Row] = Array([2,c,12])
{code}

> Exception with WHERE clause in SQL for non-default Hive database
> 
>
> Key: SPARK-15710
> URL: https://issues.apache.org/jira/browse/SPARK-15710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: databricks community edition 2.0 preview
>Reporter: Igor Fridman
>
> The following code throws an exception only with non-default database. If I 
> use 'default' database it works.
> {code}
> spark.sql("CREATE DATABASE IF NOT EXISTS test")
> spark.sql("USE test")
> df = spark.createDataFrame([
> (0, "a", 10),
> (1, "b", 11),
> (2, "c", 12),
> (3, "a", 14),
> (4, "a", 17),
> (5, "c", 18)
> ], ["id", "category", "age"])
> df.write.saveAsTable('test', mode='overwrite')
> spark.sql("SELECT * FROM test WHERE id = 2").take(1)
> {code}
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>  13 df.write.saveAsTable('test', mode='overwrite')
>  14 
> ---> 15 spark.sql("SELECT * FROM test WHERE id = 2").take(1)
> /databricks/spark/python/pyspark/sql/dataframe.py in take(self, num)
> 333 with SCCallSiteSync(self._sc) as css:
> 334 port = 
> self._sc._jvm.org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe(
> --> 335 self._jdf, num)
> 336 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 337 
> /databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 931 answer = self.gateway_client.send_command(command)
> 932 return_value = get_return_value(
> --> 933 answer, self.gateway_client, self.target_id, self.name)
> 934 
> 935 for temp_arg in temp_args:
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 310 raise Py4JJavaError(
> 311 "An error occurred while calling {0}{1}{2}.\n".
> --> 312 format(target_id, ".", name), value)
> 313 else:
> 314 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe.
> : java.lang.ClassNotFoundException: 
> org.apache.parquet.filter2.predicate.ValidTypeMap$FullTypeDescriptor
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:264)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.relaxParquetValidTypeMap$lzycompute(ParquetFilters.scala:321)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.relaxParquetValidTypeMap(ParquetFilters.scala:319)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.createFilter(ParquetFilters.scala:231)
>   at 
> 

[jira] [Commented] (SPARK-14959) ​Problem Reading partitioned ORC or Parquet files

2016-06-01 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311409#comment-15311409
 ] 

Xin Wu commented on SPARK-14959:


I can recreate the problem with hdfs location. and I have a patch for it now. I 
will submit a PR soon. 

The actual results now is following, as expected:
{code}
scala> 
spark.read.format("parquet").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_part").show
+-+---+
| text| id|
+-+---+
|hello|  0|
|world|  0|
|hello|  1|
|there|  1|
+-+---+

   
spark.read.format("orc").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_orc").show
+-+---+
| text| id|
+-+---+
|hello|  0|
|world|  0|
|hello|  1|
|there|  1|
+-+---+
{code}

> ​Problem Reading partitioned ORC or Parquet files
> -
>
> Key: SPARK-14959
> URL: https://issues.apache.org/jira/browse/SPARK-14959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4)
>Reporter: Sebastian YEPES FERNANDEZ
>Priority: Blocker
>
> Hello,
> I have noticed that in the pasts days there is an issue when trying to read 
> partitioned files from HDFS.
> I am running on Spark master branch #c544356
> The write actually works but the read fails.
> {code:title=Issue Reproduction}
> case class Data(id: Int, text: String)
> val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, 
> "world"), Data(1, "there")) )
> scala> 
> ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".  
>   
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.io.FileNotFoundException: Path is not a file: 
> /user/spark/test.parquet/id=0
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242)
>   at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:228)
>   at 
> 

[jira] [Updated] (SPARK-15681) Allow case-insensitiveness in sc.setLogLevel

2016-05-31 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-15681:
---
Description: 
Currently SparkContext API setLogLevel(level: String) can not handle lower case 
or mixed case input string. But org.apache.log4j.Level.toLevel can take 
lowercase or mixed case. 


  was:
Currently SparkContext API setLogLevel(level: String) can not handle lower case 
or mixed case input string. But org.apache.log4j.Level.toLevel can take 
lowercase or mixed case. 

Also resetLogLevel to original configuration could be helpful for users to 
switch  log level for different diagnostic purposes.




> Allow case-insensitiveness in sc.setLogLevel
> 
>
> Key: SPARK-15681
> URL: https://issues.apache.org/jira/browse/SPARK-15681
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Priority: Minor
>
> Currently SparkContext API setLogLevel(level: String) can not handle lower 
> case or mixed case input string. But org.apache.log4j.Level.toLevel can take 
> lowercase or mixed case. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15681) Allow case-insensitiveness in sc.setLogLevel

2016-05-31 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-15681:
---
Summary: Allow case-insensitiveness in sc.setLogLevel  (was: Allow 
case-insensitiveness in sc.setLogLevel and support sc.resetLogLevel)

> Allow case-insensitiveness in sc.setLogLevel
> 
>
> Key: SPARK-15681
> URL: https://issues.apache.org/jira/browse/SPARK-15681
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Priority: Minor
>
> Currently SparkContext API setLogLevel(level: String) can not handle lower 
> case or mixed case input string. But org.apache.log4j.Level.toLevel can take 
> lowercase or mixed case. 
> Also resetLogLevel to original configuration could be helpful for users to 
> switch  log level for different diagnostic purposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15681) Allow case-insensitiveness in sc.setLogLevel and support sc.resetLogLevel

2016-05-31 Thread Xin Wu (JIRA)
Xin Wu created SPARK-15681:
--

 Summary: Allow case-insensitiveness in sc.setLogLevel and support 
sc.resetLogLevel
 Key: SPARK-15681
 URL: https://issues.apache.org/jira/browse/SPARK-15681
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Xin Wu


Currently SparkContext API setLogLevel(level: String) can not handle lower case 
or mixed case input string. But org.apache.log4j.Level.toLevel can take 
lowercase or mixed case. 

Also resetLogLevel to original configuration could be helpful for users to 
switch  log level for different diagnostic purposes.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14361) Support EXCLUDE clause in Window function framing

2016-05-27 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-14361:
---
Issue Type: New Feature  (was: Improvement)

> Support EXCLUDE clause in Window function framing
> -
>
> Key: SPARK-14361
> URL: https://issues.apache.org/jira/browse/SPARK-14361
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> The current Spark SQL does not support the exclusion clause in Window 
> function framing, which is part of ANSI SQL2003’s Window syntax. For example, 
> IBM Netezza fully supports it as shown in the 
> https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html).
>  We propose to implement it in this JIRA.. 
> The ANSI SQL2003's Window Syntax:
> {code}
> FUNCTION_NAME(expr) OVER {window_name | (window_specification)}
> window_specification ::= [window_name] [partitioning] [ordering] [framing]
> partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name]
> ordering ::= ORDER [SIBLINGS] BY rule[, rule...]
> rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}]
> framing ::= {ROWS | RANGE} {start | between} [exclusion]
> start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW}
> between ::= BETWEEN bound AND bound
> bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING}
> exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE 
> NO OTHERS}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in RDD[String]

2016-05-24 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299167#comment-15299167
 ] 

Xin Wu commented on SPARK-15463:


I am looking into this. 

> Support for creating a dataframe from CSV in RDD[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15431) Support LIST FILE(s)|JAR(s) command natively

2016-05-19 Thread Xin Wu (JIRA)
Xin Wu created SPARK-15431:
--

 Summary: Support LIST FILE(s)|JAR(s) command natively
 Key: SPARK-15431
 URL: https://issues.apache.org/jira/browse/SPARK-15431
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xin Wu


Currently command "ADD FILE|JAR " is supported natively in  
SparkSQL. However, when this command is run, the file/jar is added to the 
resources that can not be looked up by "LIST FILE(s)|JAR(s)" command because 
the LIST command is passed to Hive command processor in Spark-SQL or simply not 
supported in Spark-shell. There is no way users can find out what files/jars 
are added to the spark context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15236) No way to disable Hive support in REPL

2016-05-12 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282297#comment-15282297
 ] 

Xin Wu commented on SPARK-15236:


i am looking into this

> No way to disable Hive support in REPL
> --
>
> Key: SPARK-15236
> URL: https://issues.apache.org/jira/browse/SPARK-15236
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> If you built Spark with Hive classes, there's no switch to flip to start a 
> new `spark-shell` using the InMemoryCatalog. The only thing you can do now is 
> to rebuild Spark again. That is quite inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15269) Creating external table leaves empty directory under warehouse directory

2016-05-12 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282215#comment-15282215
 ] 

Xin Wu edited comment on SPARK-15269 at 5/12/16 11:04 PM:
--

FYI.. 
The reason why the default database paths obtained by different ways are 
different as mentioned above, is that I have an older metastore_db in my 
SPARK_HOME, where the metastore database keeps the old 
hive.metastore.warehouse.dir value (/user/hive/warehouse). After I removed this 
metastore_db, I get the database path consistent now. 

Testing the fix for #2 now. Will submit PR soon. 


was (Author: xwu0226):
FYI.. 
The reason why the default database paths obtained by different ways are 
different as mentioned above, is that I have an older metastore_db in my 
SPARK_HOME, where the metastore database keeps the old 
hive.metastore.warehouse.dir value (/user/hive/warehouse). After I removed this 
metastore_db, I get the database path consistent now. 

> Creating external table leaves empty directory under warehouse directory
> 
>
> Key: SPARK-15269
> URL: https://issues.apache.org/jira/browse/SPARK-15269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Adding the following test case in {{HiveDDLSuite}} may reproduce this issue:
> {code}
>   test("foo") {
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   spark.range(1).write.json(path)
>   withTable("ddl_test1") {
> sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')")
> sql("DROP TABLE ddl_test1")
> sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a")
>   }
> }
>   }
> {code}
> Note that the first {{CREATE TABLE}} command creates an external table since 
> data source tables are always external when {{PATH}} option is specified.
> When executing the second {{CREATE TABLE}} command, which creates a managed 
> table with the same name, it fails because there's already an unexpected 
> directory with the same name as the table name in the warehouse directory:
> {noformat}
> [info] - foo *** FAILED *** (7 seconds, 649 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: path 
> file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1
>  already exists.;
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
> [info]   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)

[jira] [Commented] (SPARK-15269) Creating external table leaves empty directory under warehouse directory

2016-05-12 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282215#comment-15282215
 ] 

Xin Wu commented on SPARK-15269:


FYI.. 
The reason why the default database paths obtained by different ways are 
different as mentioned above, is that I have an older metastore_db in my 
SPARK_HOME, where the metastore database keeps the old 
hive.metastore.warehouse.dir value (/user/hive/warehouse). After I removed this 
metastore_db, I get the database path consistent now. 

> Creating external table leaves empty directory under warehouse directory
> 
>
> Key: SPARK-15269
> URL: https://issues.apache.org/jira/browse/SPARK-15269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Adding the following test case in {{HiveDDLSuite}} may reproduce this issue:
> {code}
>   test("foo") {
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   spark.range(1).write.json(path)
>   withTable("ddl_test1") {
> sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')")
> sql("DROP TABLE ddl_test1")
> sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a")
>   }
> }
>   }
> {code}
> Note that the first {{CREATE TABLE}} command creates an external table since 
> data source tables are always external when {{PATH}} option is specified.
> When executing the second {{CREATE TABLE}} command, which creates a managed 
> table with the same name, it fails because there's already an unexpected 
> directory with the same name as the table name in the warehouse directory:
> {noformat}
> [info] - foo *** FAILED *** (7 seconds, 649 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: path 
> file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1
>  already exists.;
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
> [info]   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
> [info]   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
> [info]   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
> [info]   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
> 

[jira] [Commented] (SPARK-15269) Creating external table leaves empty directory under warehouse directory

2016-05-12 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281748#comment-15281748
 ] 

Xin Wu commented on SPARK-15269:


Yes, I can . Thanks!

> Creating external table leaves empty directory under warehouse directory
> 
>
> Key: SPARK-15269
> URL: https://issues.apache.org/jira/browse/SPARK-15269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Adding the following test case in {{HiveDDLSuite}} may reproduce this issue:
> {code}
>   test("foo") {
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   spark.range(1).write.json(path)
>   withTable("ddl_test1") {
> sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')")
> sql("DROP TABLE ddl_test1")
> sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a")
>   }
> }
>   }
> {code}
> Note that the first {{CREATE TABLE}} command creates an external table since 
> data source tables are always external when {{PATH}} option is specified.
> When executing the second {{CREATE TABLE}} command, which creates a managed 
> table with the same name, it fails because there's already an unexpected 
> directory with the same name as the table name in the warehouse directory:
> {noformat}
> [info] - foo *** FAILED *** (7 seconds, 649 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: path 
> file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1
>  already exists.;
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
> [info]   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
> [info]   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
> [info]   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
> [info]   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:59)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:59)
> [info]   at 
> 

[jira] [Comment Edited] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory

2016-05-12 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281621#comment-15281621
 ] 

Xin Wu edited comment on SPARK-15269 at 5/12/16 3:37 PM:
-

For the case where we can not recreate this issue, it is because the default 
database path we got at {code}if (!new 
CaseInsensitiveMap(options).contains("path")) {
isExternal = false
options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
  } else {
options
  }{code}  is different from hive metastore's default warehouse dir. They 
are "/user/hive/warehouse" and "/spark-warehouse", respectively.  

When creating the first table, hive metastore's default warehouse dir is 
"/spark-warehouse", while when creating the second table without 
PATH option, the sessionState.catalog.defaultTablePath returns  
"/user/hive/warehouse". Therefore, the 2nd table creation will not hit the 
issue. But the first table still leave the empty table directory behind after 
being dropped. 

Two questions:
1. Should we keep these 2 default database path consistent?
2. If they are consistent, we will hit the issue reported in this JIRA.. Then, 
can we also assign the provided path to the CatalogTable.storage.locationURI, 
even though newSparkSQLSpecificMetastoreTable is called in 
createDataSourceTables for a non-hive compatible metastore table? This will 
avoid leaving hive metastore to pick the default path for the table. 



was (Author: xwu0226):
For the case where we can not recreate this issue, it is because the default 
database path we got at {code}if (!new 
CaseInsensitiveMap(options).contains("path")) {
isExternal = false
options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
  } else {
options
  }{code}  is different from hive metastore's default warehouse dir. They 
are "/user/hive/warehouse" and "/spark-warehouse", respectively.  

When creating the first table, hive metastore's default warehouse dir is 
"/spark-warehouse", while when creating the second table without 
PATH option, the sessionState.catalog.defaultTablePath returns  
"/user/hive/warehouse". Therefore, the 2nd table creation will not hit the 
issue. But the first table still leave the empty table directory behind after 
being dropped. 

Two questions:
1. Should we keep these 2 default database path consistent?
2. If they are consistent, we will hit the issue reported in this JIRA.. Then, 
can we also assign the provided path to the CatalogTable.storage.locationURI, 
even though newSparkSQLSpecificMetastoreTable is called in 
createDataSourceTables for a non-hive compatible metastore table? 


> Creating external table in test code leaves empty directory under warehouse 
> directory
> -
>
> Key: SPARK-15269
> URL: https://issues.apache.org/jira/browse/SPARK-15269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> It seems that this issue doesn't affect production code. I couldn't reproduce 
> it using Spark shell.
> Adding the following test case in {{HiveDDLSuite}} may reproduce this issue:
> {code}
>   test("foo") {
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   spark.range(1).write.json(path)
>   withTable("ddl_test1") {
> sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')")
> sql("DROP TABLE ddl_test1")
> sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a")
>   }
> }
>   }
> {code}
> Note that the first {{CREATE TABLE}} command creates an external table since 
> data source tables are always external when {{PATH}} option is specified.
> When executing the second {{CREATE TABLE}} command, which creates a managed 
> table with the same name, it fails because there's already an unexpected 
> directory with the same name as the table name in the warehouse directory:
> {noformat}
> [info] - foo *** FAILED *** (7 seconds, 649 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: path 
> file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1
>  already exists.;
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> 

[jira] [Commented] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory

2016-05-12 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281621#comment-15281621
 ] 

Xin Wu commented on SPARK-15269:


For the case where we can not recreate this issue, it is because the default 
database path we got at {code}if (!new 
CaseInsensitiveMap(options).contains("path")) {
isExternal = false
options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
  } else {
options
  }{code}  is different from hive metastore's default warehouse dir. They 
are "/user/hive/warehouse" and "/spark-warehouse", respectively.  

When creating the first table, hive metastore's default warehouse dir is 
"/spark-warehouse", while when creating the second table without 
PATH option, the sessionState.catalog.defaultTablePath returns  
"/user/hive/warehouse". Therefore, the 2nd table creation will not hit the 
issue. But the first table still leave the empty table directory behind after 
being dropped. 

Two questions:
1. Should we keep these 2 default database path consistent?
2. If they are consistent, we will hit the issue reported in this JIRA.. Then, 
can we also assign the provided path to the CatalogTable.storage.locationURI, 
even though newSparkSQLSpecificMetastoreTable is called in 
createDataSourceTables for a non-hive compatible metastore table? 


> Creating external table in test code leaves empty directory under warehouse 
> directory
> -
>
> Key: SPARK-15269
> URL: https://issues.apache.org/jira/browse/SPARK-15269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> It seems that this issue doesn't affect production code. I couldn't reproduce 
> it using Spark shell.
> Adding the following test case in {{HiveDDLSuite}} may reproduce this issue:
> {code}
>   test("foo") {
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   spark.range(1).write.json(path)
>   withTable("ddl_test1") {
> sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')")
> sql("DROP TABLE ddl_test1")
> sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a")
>   }
> }
>   }
> {code}
> Note that the first {{CREATE TABLE}} command creates an external table since 
> data source tables are always external when {{PATH}} option is specified.
> When executing the second {{CREATE TABLE}} command, which creates a managed 
> table with the same name, it fails because there's already an unexpected 
> directory with the same name as the table name in the warehouse directory:
> {noformat}
> [info] - foo *** FAILED *** (7 seconds, 649 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: path 
> file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1
>  already exists.;
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
> [info]   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> 

[jira] [Comment Edited] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory

2016-05-11 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280877#comment-15280877
 ] 

Xin Wu edited comment on SPARK-15269 at 5/11/16 10:13 PM:
--

The root cause maybe the following?

When the first table is created as external table with the data source path, 
but as json,  createDataSourceTables considers it as non-hive compatible table 
because json is not a Hive SerDe. Then, newSparkSQLSpecificMetastoreTable is 
invoked to create the CatalogTable before asking HiveClient to create the 
metastore table. In this call,  locationURI is not set. So when we convert 
CatalogTable to HiveTable before passing to Hive Metastore, hive table's data 
location is not set. Then, Hive metastore implicitly creates a data location as 
/tableName, which is 
{code}/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1{code}
 in this JIRA. I also verified that creating an external directly in Hive shell 
without a path will result in a default table directory created by hive. 

Then, even after dropping table, hive will not delete this stealth directory 
because the table is external. 

when we create the 2nd table with select and without a path, the table is 
created as managed table, provided a default path in the options:
{code}val optionsWithPath =
  if (!new CaseInsensitiveMap(options).contains("path")) {
isExternal = false
options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
  } else {
options
  }{code}
This default path happens to be the hive's warehouse directory + the table 
name, which is the same as the one hive metastore implicitly created earlier 
for the 1st table.  So when trying to write the provided data to this data 
source table by {code} val plan =
  InsertIntoHadoopFsRelation(
outputPath,
partitionColumns.map(UnresolvedAttribute.quoted),
bucketSpec,
format,
() => Unit, // No existing table needs to be refreshed.
options,
data.logicalPlan,
mode){code}, InsertIntoHadoopFsRelation complains about the path 
existence since the SaveMode is SaveMode.ErrorIfExists.


was (Author: xwu0226):
The root cause maybe the following?

When the first table is created as external table with the data source path, 
but as json,  createDataSourceTables considers it as non-hive compatible table 
because json is not a Hive SerDe. Then, newSparkSQLSpecificMetastoreTable is 
invoked to create the CatalogTable before asking HiveClient to create the 
metastore table. In this call,  locationURI is not set. So when we convert 
CatalogTable to HiveTable before passing to Hive Metastore, hive table's data 
location is not set. Then, Hive metastore implicitly creates a data location as 
/tableName, which is 
{code}/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1{code}
 in this JIRA. I also verified that creating an external directly in Hive shell 
without a path will result in a default table directory created by hive. 

Then, even after dropping table, hive will not delete this stealth directory 
because the table is external. 

> Creating external table in test code leaves empty directory under warehouse 
> directory
> -
>
> Key: SPARK-15269
> URL: https://issues.apache.org/jira/browse/SPARK-15269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> It seems that this issue doesn't affect production code. I couldn't reproduce 
> it using Spark shell.
> Adding the following test case in {{HiveDDLSuite}} may reproduce this issue:
> {code}
>   test("foo") {
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   spark.range(1).write.json(path)
>   withTable("ddl_test1") {
> sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')")
> sql("DROP TABLE ddl_test1")
> sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a")
>   }
> }
>   }
> {code}
> Note that the first {{CREATE TABLE}} command creates an external table since 
> data source tables are always external when {{PATH}} option is specified.
> When executing the second {{CREATE TABLE}} command, which creates a managed 
> table with the same name, it fails because there's already an unexpected 
> directory with the same name as the table name in the warehouse directory:
> {noformat}
> [info] - foo *** FAILED *** (7 seconds, 649 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: path 
> 

[jira] [Commented] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory

2016-05-11 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280888#comment-15280888
 ] 

Xin Wu commented on SPARK-15269:


In spark-shell, I can recreate it as following:
{code}
scala> spark.range(1).write.json("/home/xwu0226/spark-test/data/spark-15269")
Datasource.write -> Path: file:/home/xwu0226/spark-test/data/spark-15269

scala> spark.sql("create table spark_15269 using json options(PATH 
'/home/xwu0226/spark-test/data/spark-15269')")
16/05/11 14:51:00 WARN CreateDataSourceTableUtils: Couldn't find corresponding 
Hive SerDe for data source provider json. Persisting data source relation 
`spark_15269` into Hive metastore in Spark SQL specific format, which is NOT 
compatible with Hive.
going through newSparkSQLSpecificMetastoreTable()
res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("drop table spark_15269")
res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create table spark_15269 using json as select 1 as a")
org.apache.spark.sql.AnalysisException: path 
file:/user/hive/warehouse/spark_15269 already exists.;
  at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
  at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:418)
  at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:229)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
  at org.apache.spark.sql.Dataset.(Dataset.scala:186)
  at org.apache.spark.sql.Dataset.(Dataset.scala:167)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
  ... 48 elided
{code}

> Creating external table in test code leaves empty directory under warehouse 
> directory
> -
>
> Key: SPARK-15269
> URL: https://issues.apache.org/jira/browse/SPARK-15269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> It seems that this issue doesn't affect production code. I couldn't reproduce 
> it using Spark shell.
> Adding the following test case in {{HiveDDLSuite}} may reproduce this issue:
> {code}
>   test("foo") {
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   spark.range(1).write.json(path)
>   withTable("ddl_test1") {
> sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')")
> sql("DROP TABLE ddl_test1")
> sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a")
>   }
> }
>   }
> {code}
> Note that the first {{CREATE TABLE}} command creates an external table since 
> data source tables are always external when {{PATH}} option is 

[jira] [Comment Edited] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory

2016-05-11 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280877#comment-15280877
 ] 

Xin Wu edited comment on SPARK-15269 at 5/11/16 9:47 PM:
-

The root cause maybe the following?

When the first table is created as external table with the data source path, 
but as json,  createDataSourceTables considers it as non-hive compatible table 
because json is not a Hive SerDe. Then, newSparkSQLSpecificMetastoreTable is 
invoked to create the CatalogTable before asking HiveClient to create the 
metastore table. In this call,  locationURI is not set. So when we convert 
CatalogTable to HiveTable before passing to Hive Metastore, hive table's data 
location is not set. Then, Hive metastore implicitly creates a data location as 
/tableName, which is 
{code}/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1{code}
 in this JIRA. I also verified that creating an external directly in Hive shell 
without a path will result in a default table directory created by hive. 

Then, even after dropping table, hive will not delete this stealth directory 
because the table is external. 


was (Author: xwu0226):
The root cause maybe the following?

When the first table is created as external table with the data source path, 
but as `json`,  `createDataSourceTables` considers it as non-hive compatible 
table because `json` is not a Hive SerDe. Then, 
`newSparkSQLSpecificMetastoreTable` is invoked to create the `CatalogTable` 
before asking `HiveClient` to create the metastore table. In this call,  
`locationURI` is not set. So when we convert CatalogTable to HiveTable before 
passing to Hive Metastore, hive table's data location is not set. Then, Hive 
metastore implicitly creates a data location as `/tableName`, 
which 
`/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1`
 in this JIRA. I also verified that creating an external directly in Hive shell 
without a path will result in a default table directory created by hive. 

Then, even after dropping table, hive will not delete this stealth directory 
because the table is external. 

> Creating external table in test code leaves empty directory under warehouse 
> directory
> -
>
> Key: SPARK-15269
> URL: https://issues.apache.org/jira/browse/SPARK-15269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> It seems that this issue doesn't affect production code. I couldn't reproduce 
> it using Spark shell.
> Adding the following test case in {{HiveDDLSuite}} may reproduce this issue:
> {code}
>   test("foo") {
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   spark.range(1).write.json(path)
>   withTable("ddl_test1") {
> sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')")
> sql("DROP TABLE ddl_test1")
> sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a")
>   }
> }
>   }
> {code}
> Note that the first {{CREATE TABLE}} command creates an external table since 
> data source tables are always external when {{PATH}} option is specified.
> When executing the second {{CREATE TABLE}} command, which creates a managed 
> table with the same name, it fails because there's already an unexpected 
> directory with the same name as the table name in the warehouse directory:
> {noformat}
> [info] - foo *** FAILED *** (7 seconds, 649 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: path 
> file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1
>  already exists.;
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> 

[jira] [Commented] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory

2016-05-11 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280877#comment-15280877
 ] 

Xin Wu commented on SPARK-15269:


The root cause maybe the following?

When the first table is created as external table with the data source path, 
but as `json`,  `createDataSourceTables` considers it as non-hive compatible 
table because `json` is not a Hive SerDe. Then, 
`newSparkSQLSpecificMetastoreTable` is invoked to create the `CatalogTable` 
before asking `HiveClient` to create the metastore table. In this call,  
`locationURI` is not set. So when we convert CatalogTable to HiveTable before 
passing to Hive Metastore, hive table's data location is not set. Then, Hive 
metastore implicitly creates a data location as `/tableName`, 
which 
`/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1`
 in this JIRA. I also verified that creating an external directly in Hive shell 
without a path will result in a default table directory created by hive. 

Then, even after dropping table, hive will not delete this stealth directory 
because the table is external. 

> Creating external table in test code leaves empty directory under warehouse 
> directory
> -
>
> Key: SPARK-15269
> URL: https://issues.apache.org/jira/browse/SPARK-15269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> It seems that this issue doesn't affect production code. I couldn't reproduce 
> it using Spark shell.
> Adding the following test case in {{HiveDDLSuite}} may reproduce this issue:
> {code}
>   test("foo") {
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   spark.range(1).write.json(path)
>   withTable("ddl_test1") {
> sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')")
> sql("DROP TABLE ddl_test1")
> sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a")
>   }
> }
>   }
> {code}
> Note that the first {{CREATE TABLE}} command creates an external table since 
> data source tables are always external when {{PATH}} option is specified.
> When executing the second {{CREATE TABLE}} command, which creates a managed 
> table with the same name, it fails because there's already an unexpected 
> directory with the same name as the table name in the warehouse directory:
> {noformat}
> [info] - foo *** FAILED *** (7 seconds, 649 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: path 
> file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1
>  already exists.;
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
> [info]   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> 

[jira] [Updated] (SPARK-15206) Add testcases for Distinct Aggregation in Having clause

2016-05-08 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-15206:
---
Issue Type: Test  (was: Bug)

> Add testcases for Distinct Aggregation in Having clause
> ---
>
> Key: SPARK-15206
> URL: https://issues.apache.org/jira/browse/SPARK-15206
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> This is the followup jira for https://github.com/apache/spark/pull/12974. We 
> will add test cases for including distinct aggregate function in having 
> clause. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15206) Add testcases for Distinct Aggregation in Having clause

2016-05-07 Thread Xin Wu (JIRA)
Xin Wu created SPARK-15206:
--

 Summary: Add testcases for Distinct Aggregation in Having clause
 Key: SPARK-15206
 URL: https://issues.apache.org/jira/browse/SPARK-15206
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xin Wu


This is the followup jira for https://github.com/apache/spark/pull/12974. We 
will add test cases for including distinct aggregate function in having clause. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14495) Distinct aggregation cannot be used in the having clause

2016-05-05 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271956#comment-15271956
 ] 

Xin Wu edited comment on SPARK-14495 at 5/5/16 6:25 AM:


[~smilegator] I got the fix and running regtest now. Will submit the PR once it 
is done. 


was (Author: xwu0226):
[~smilegator] I got the fix and running regtest now. Will submit the PR one it 
is done. 

> Distinct aggregation cannot be used in the having clause
> 
>
> Key: SPARK-14495
> URL: https://issues.apache.org/jira/browse/SPARK-14495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yin Huai
>
> {code}
> select date, count(distinct id)
> from (select '2010-01-01' as date, 1 as id) tmp
> group by date
> having count(distinct id) > 0;
> org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 
> missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if 
> ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], 
> [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14495) Distinct aggregation cannot be used in the having clause

2016-05-05 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271956#comment-15271956
 ] 

Xin Wu commented on SPARK-14495:


[~smilegator] I got the fix and running regtest now. Will submit the PR one it 
is done. 

> Distinct aggregation cannot be used in the having clause
> 
>
> Key: SPARK-14495
> URL: https://issues.apache.org/jira/browse/SPARK-14495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yin Huai
>
> {code}
> select date, count(distinct id)
> from (select '2010-01-01' as date, 1 as id) tmp
> group by date
> having count(distinct id) > 0;
> org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 
> missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if 
> ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], 
> [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-05-02 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268071#comment-15268071
 ] 

Xin Wu commented on SPARK-15044:


Sorry. What I meant was that after I removed the path manually, then did the 
alter table drop partition command in spark sql, then, I can do select.

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-05-02 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267576#comment-15267576
 ] 

Xin Wu edited comment on SPARK-14927 at 5/2/16 10:01 PM:
-

right now, when a datasource table is created with partition, it is not a hive 
compatiable table. 

So maybe need to create the table like {code}create table tmp.tmp1 (val string) 
partitioned by (year int) stored as parquet location '' {code}
Then insert into the table with a temp table that is derived from the 
dataframe. Something I tried below 
{code}
scala> df.show
++---+
|year|val|
++---+
|2012|  a|
|2013|  b|
|2014|  c|
++---+

scala> val df1 = spark.sql("select * from t000 where year = 2012")
df1: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df1.registerTempTable("df1")

scala> spark.sql("insert into tmp.ptest3 partition(year=2012) select * from 
df1")

scala> val df2 = spark.sql("select * from t000 where year = 2013")
df2: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df2.registerTempTable("df2")

scala> spark.sql("insert into tmp.ptest3 partition(year=2013) select val from 
df2")
16/05/02 14:47:34 WARN log: Updating partition stats fast for: ptest3
16/05/02 14:47:34 WARN log: Updated size to 327
res54: org.apache.spark.sql.DataFrame = []

scala> spark.sql("show partitions tmp.ptest3").show
+-+
|   result|
+-+
|year=2012|
|year=2013|
+-+

{code}

This is a bit hacky though. There should be a better solution for your problem. 
And this is on spark 2.0.  Try if 1.6 can take this. 


was (Author: xwu0226):
right now, when a datasource table is created with partition, it is not a hive 
compatiable table. 

So maybe need to create the table like {code}create table tmp.tmp1 (val string) 
partitioned by (year int) stored as parquet location '' {code}
Then insert into the table with a temp table that is derived from the 
dataframe. Something I tried below 
{code}
scala> df.show
++---+
|year|val|
++---+
|2012|  a|
|2013|  b|
|2014|  c|
++---+

scala> val df1 = spark.sql("select * from t000 where year = 2012")
df1: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df1.registerTempTable("df1")

scala> spark.sql("insert into tmp.ptest3 partition(year=2012) select * from 
df1")

scala> val df2 = spark.sql("select * from t000 where year = 2013")
df2: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df2.registerTempTable("df2")

scala> spark.sql("insert into tmp.ptest3 partition(year=2013) select val from 
df2")
16/05/02 14:47:34 WARN log: Updating partition stats fast for: ptest3
16/05/02 14:47:34 WARN log: Updated size to 327
res54: org.apache.spark.sql.DataFrame = []

scala> spark.sql("show partitions tmp.ptest3").show
+-+
|   result|
+-+
|year=2012|
|year=2013|
+-+

{code}

This is a bit hacky though. hope someone has a better solution for your 
problem. And this is on spark 2.0.  Try if 1.6 can take this. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> 

[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-05-02 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267576#comment-15267576
 ] 

Xin Wu commented on SPARK-14927:


right now, when a datasource table is created with partition, it is not a hive 
compatiable table. 

So maybe need to create the table like {code}create table tmp.tmp1 (val string) 
partitioned by (year int) stored as parquet location '' {code}
Then insert into the table with a temp table that is derived from the 
dataframe. Something I tried below 
{code}
scala> df.show
++---+
|year|val|
++---+
|2012|  a|
|2013|  b|
|2014|  c|
++---+

scala> val df1 = spark.sql("select * from t000 where year = 2012")
df1: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df1.registerTempTable("df1")

scala> spark.sql("insert into tmp.ptest3 partition(year=2012) select * from 
df1")

scala> val df2 = spark.sql("select * from t000 where year = 2013")
df2: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df2.registerTempTable("df2")

scala> spark.sql("insert into tmp.ptest3 partition(year=2013) select val from 
df2")
16/05/02 14:47:34 WARN log: Updating partition stats fast for: ptest3
16/05/02 14:47:34 WARN log: Updated size to 327
res54: org.apache.spark.sql.DataFrame = []

scala> spark.sql("show partitions tmp.ptest3").show
+-+
|   result|
+-+
|year=2012|
|year=2013|
+-+

{code}

This is a bit hacky though. hope someone has a better solution for your 
problem. And this is on spark 2.0.  Try if 1.6 can take this. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-05-01 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266116#comment-15266116
 ] 

Xin Wu commented on SPARK-15044:


I tried {code}alter table test drop partition (p=1){code} , then the select 
will return 0 rows without exception. 

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14495) Distinct aggregation cannot be used in the having clause

2016-05-01 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266069#comment-15266069
 ] 

Xin Wu edited comment on SPARK-14495 at 5/2/16 2:21 AM:


I can recreate it on branch-1.6. and another workaround is using alias for the 
aggregate expression
{code}
scala> sqlContext.sql("SELECT date, count(distinct id) as cnt from (select 
'2010-01-01' as date, 1 as id) tmp group by date having cnt > 0").show
+--+---+
|  date|cnt|
+--+---+
|2010-01-01|  1|
+--+---+
{code}




was (Author: xwu0226):
I can recreated it on branch-1.6. and another workaround is using alias for the 
aggregate expression
{code}
scala> sqlContext.sql("SELECT date, count(distinct id) as cnt from (select 
'2010-01-01' as date, 1 as id) tmp group by date having cnt > 0").show
+--+---+
|  date|cnt|
+--+---+
|2010-01-01|  1|
+--+---+
{code}



> Distinct aggregation cannot be used in the having clause
> 
>
> Key: SPARK-14495
> URL: https://issues.apache.org/jira/browse/SPARK-14495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yin Huai
>
> {code}
> select date, count(distinct id)
> from (select '2010-01-01' as date, 1 as id) tmp
> group by date
> having count(distinct id) > 0;
> org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 
> missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if 
> ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], 
> [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14495) Distinct aggregation cannot be used in the having clause

2016-05-01 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266069#comment-15266069
 ] 

Xin Wu commented on SPARK-14495:


I can recreated it on branch-1.6. and another workaround is using alias for the 
aggregate expression
{code}
scala> sqlContext.sql("SELECT date, count(distinct id) as cnt from (select 
'2010-01-01' as date, 1 as id) tmp group by date having cnt > 0").show
+--+---+
|  date|cnt|
+--+---+
|2010-01-01|  1|
+--+---+
{code}



> Distinct aggregation cannot be used in the having clause
> 
>
> Key: SPARK-14495
> URL: https://issues.apache.org/jira/browse/SPARK-14495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yin Huai
>
> {code}
> select date, count(distinct id)
> from (select '2010-01-01' as date, 1 as id) tmp
> group by date
> having count(distinct id) > 0;
> org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 
> missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if 
> ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], 
> [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-04-30 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265591#comment-15265591
 ] 

Xin Wu commented on SPARK-14927:


Since Spark 2.0.0 has moved around a lot of stuff, including splitting the 
HiveMetaStoreCatalog into 2 files for resolving and creating tables, 
respectively, I would try this on Spark 2.0.0. 

{code}scala> spark.sql("create database if not exists tmp")
16/04/30 19:59:12 WARN ObjectStore: Failed to get database tmp, returning 
NoSuchObjectException
res23: org.apache.spark.sql.DataFrame = []

scala> 
df.write.partitionBy("year").mode(SaveMode.Append).saveAsTable("tmp.tmp1")
16/04/30 19:59:50 WARN CreateDataSourceTableUtils: Persisting partitioned data 
source relation `tmp`.`tmp1` into Hive metastore in Spark SQL specific format, 
which is NOT compatible with Hive. Input path(s): 
file:/home/xwu0226/spark/spark-warehouse/tmp.db/tmp1

scala> spark.sql("select * from tmp.tmp1").show
+---++
|val|year|
+---++
|  a|2012|
+---++
{code}

For datasource table creation as above, SparkSQL will create the table as a 
hive internal table but not compatible with hive. SparkSQL puts partition 
column information (actually including also other things like column schema, 
bucket/sort columns) into serdeInfo.parameters. When querying the table, 
SparkSQL resolve the table and parse the information back from 
serdeInfo.parameters. 

Spark 2.0.0 does not pass this command to Hive anymore (actually most of DDL 
commands are run natively in SparkSQL now), so when doing "SHOW PARTITIONS...", 
the command now does not support showing partitions for datasource table. 

{code}
scala> spark.sql("show partitions tmp.tmp1").show
org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on a 
datasource table: tmp.tmp1;
  at 
org.apache.spark.sql.execution.command.ShowPartitionsCommand.run(commands.scala:196)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:132)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:129)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:112)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
  at org.apache.spark.sql.Dataset.(Dataset.scala:186)
  at org.apache.spark.sql.Dataset.(Dataset.scala:167)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:529)
  ... 48 elided
{code}

Hope this helps. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> 

[jira] [Commented] (SPARK-15025) creating datasource table with option (PATH) results in duplicate path key in serdeProperties

2016-04-29 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265127#comment-15265127
 ] 

Xin Wu commented on SPARK-15025:


I am testing a fix on this . will submit a PR soon. 

> creating datasource table with option (PATH) results in duplicate path key in 
> serdeProperties
> -
>
> Key: SPARK-15025
> URL: https://issues.apache.org/jira/browse/SPARK-15025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> Repro:
> {code}create table t1 using parquet options (PATH "/tmp/t1") as select 1 as 
> a, 2 as b{code}
> This will create a hive external table whose dataLocation is 
> "/someDefaultPath", which is not the same as the provided one.  Yet, 
> serdeInfo.parameters contain following key value pairs: 
> PATH, "/tmp/t1"
> path, "/someDefaultPath"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15025) creating datasource table with option (PATH) results in duplicate path key in serdeProperties

2016-04-29 Thread Xin Wu (JIRA)
Xin Wu created SPARK-15025:
--

 Summary: creating datasource table with option (PATH) results in 
duplicate path key in serdeProperties
 Key: SPARK-15025
 URL: https://issues.apache.org/jira/browse/SPARK-15025
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xin Wu


Repro:

{code}create table t1 using parquet options (PATH "/tmp/t1") as select 1 as a, 
2 as b{code}

This will create a hive external table whose dataLocation is 
"/someDefaultPath", which is not the same as the provided one.  Yet, 
serdeInfo.parameters contain following key value pairs: 
PATH, "/tmp/t1"
path, "/someDefaultPath"






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14933) Failed to create view out of a parquet or orc table

2016-04-26 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259084#comment-15259084
 ] 

Xin Wu commented on SPARK-14933:


I have a fix for this and will submit a PR soon.

> Failed to create view out of a parquet or orc table 
> 
>
> Key: SPARK-14933
> URL: https://issues.apache.org/jira/browse/SPARK-14933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Priority: Critical
> Fix For: 2.0.0
>
>
> When I create a table as parquet or orc with following DDL:
> {code}
> create table t1 (c1 int, c2 string) stored as parquet;
> create table t2 (c1 int, c2 string) stored as orc;
> {code}
> Then, do:
> {code}create view v1 as select * from t1;{code}
> The view creation fails because of following error:
> {code}
> Caused by: java.lang.UnsupportedOperationException: unsupported plan 
> Relation[c1#66,c2#67] HadoopFiles
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:191)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:149)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:208)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:111)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:149)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:208)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:111)
>   at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:81)
>   at 
> org.apache.spark.sql.catalyst.LogicalPlanToSQLSuite.org$apache$spark$sql$catalyst$LogicalPlanToSQLSuite$$checkHiveQl(LogicalPlanToSQLSuite.scala:82)
>   ... 57 more
> {code}
> The error actually happens in the path of converting LogicalPlan to SQL for 
> the LogicalRelation of the HadoopFsRelation (t1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14933) Failed to create view out of a parquet or orc table

2016-04-26 Thread Xin Wu (JIRA)
Xin Wu created SPARK-14933:
--

 Summary: Failed to create view out of a parquet or orc table 
 Key: SPARK-14933
 URL: https://issues.apache.org/jira/browse/SPARK-14933
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xin Wu
Priority: Critical
 Fix For: 2.0.0


When I create a table as parquet or orc with following DDL:
{code}
create table t1 (c1 int, c2 string) stored as parquet;
create table t2 (c1 int, c2 string) stored as orc;
{code}

Then, do:
{code}create view v1 as select * from t1;{code}

The view creation fails because of following error:
{code}
Caused by: java.lang.UnsupportedOperationException: unsupported plan 
Relation[c1#66,c2#67] HadoopFiles

at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:191)
at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:149)
at 
org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:208)
at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:111)
at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:149)
at 
org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:208)
at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:111)
at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:81)
at 
org.apache.spark.sql.catalyst.LogicalPlanToSQLSuite.org$apache$spark$sql$catalyst$LogicalPlanToSQLSuite$$checkHiveQl(LogicalPlanToSQLSuite.scala:82)
... 57 more
{code}
The error actually happens in the path of converting LogicalPlan to SQL for the 
LogicalRelation of the HadoopFsRelation (t1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14361) Support EXCLUDE clause in Window function framing

2016-04-04 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-14361:
---
Description: 
The current Spark SQL does not support the exclusion clause in Window function 
framing, which is part of ANSI SQL2003’s Window syntax. For example, IBM 
Netezza fully supports it as shown in the 
https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html).
 We propose to implement it in this JIRA.. 

The ANSI SQL2003's Window Syntax:
{code}
FUNCTION_NAME(expr) OVER {window_name | (window_specification)}
window_specification ::= [window_name] [partitioning] [ordering] [framing]
partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name]
ordering ::= ORDER [SIBLINGS] BY rule[, rule...]
rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}]
framing ::= {ROWS | RANGE} {start | between} [exclusion]
start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW}
between ::= BETWEEN bound AND bound
bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING}
exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE NO 
OTHERS}
{code}

  was:
The current Spark SQL does not support the `exclusion` clause, which is part of 
ANSI 

SQL2003’s `Window` syntax. For example, IBM Netezza fully supports it as shown 
in the 

[document web link]

(https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_wi

ndow_aggregation_family_syntax.html). This PR is to fill the gap. 



> Support EXCLUDE clause in Window function framing
> -
>
> Key: SPARK-14361
> URL: https://issues.apache.org/jira/browse/SPARK-14361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> The current Spark SQL does not support the exclusion clause in Window 
> function framing, which is part of ANSI SQL2003’s Window syntax. For example, 
> IBM Netezza fully supports it as shown in the 
> https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html).
>  We propose to implement it in this JIRA.. 
> The ANSI SQL2003's Window Syntax:
> {code}
> FUNCTION_NAME(expr) OVER {window_name | (window_specification)}
> window_specification ::= [window_name] [partitioning] [ordering] [framing]
> partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name]
> ordering ::= ORDER [SIBLINGS] BY rule[, rule...]
> rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}]
> framing ::= {ROWS | RANGE} {start | between} [exclusion]
> start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW}
> between ::= BETWEEN bound AND bound
> bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING}
> exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE 
> NO OTHERS}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14361) Support EXCLUDE clause in Window function framing

2016-04-04 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223702#comment-15223702
 ] 

Xin Wu edited comment on SPARK-14361 at 4/4/16 6:06 AM:


[~hvanhovell] Since you coded the whole window function.. I would like to have 
you take a look at the PR proposal.. I will submit a PR soon. 


was (Author: xwu0226):
[~smilegator][~dkbiswal]

> Support EXCLUDE clause in Window function framing
> -
>
> Key: SPARK-14361
> URL: https://issues.apache.org/jira/browse/SPARK-14361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> The current Spark SQL does not support the *exclude* clause in Window 
> function framing clause, which is part of ANSI SQL2003's Window syntax. For 
> example, IBM Netezza fully supports it as shown in the 
> https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html)..
>  We propose to implement it in the JIRA.
> The ANSI SQL2003's Window syntax:
> {code}
> FUNCTION_NAME(expr) OVER {window_name | (window_specification)}
> window_specification ::= [window_name] [partitioning] [ordering] [framing]
> partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name]
> ordering ::= ORDER [SIBLINGS] BY rule[, rule...]
> rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}]
> framing ::= {ROWS | RANGE} {start | between} [exclusion]
> start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW}
> between ::= BETWEEN bound AND bound
> bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING}
> exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE 
> NO OTHERS}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14361) Support EXCLUDE clause in Window function framing

2016-04-04 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223702#comment-15223702
 ] 

Xin Wu commented on SPARK-14361:


[~smilegator][~dkbiswal]

> Support EXCLUDE clause in Window function framing
> -
>
> Key: SPARK-14361
> URL: https://issues.apache.org/jira/browse/SPARK-14361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> The current Spark SQL does not support the *exclude* clause in Window 
> function framing clause, which is part of ANSI SQL2003's Window syntax. For 
> example, IBM Netezza fully supports it as shown in the 
> https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html)..
>  We propose to implement it in the JIRA.
> The ANSI SQL2003's Window syntax:
> {code}
> FUNCTION_NAME(expr) OVER {window_name | (window_specification)}
> window_specification ::= [window_name] [partitioning] [ordering] [framing]
> partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name]
> ordering ::= ORDER [SIBLINGS] BY rule[, rule...]
> rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}]
> framing ::= {ROWS | RANGE} {start | between} [exclusion]
> start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW}
> between ::= BETWEEN bound AND bound
> bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING}
> exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE 
> NO OTHERS}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14361) Support EXCLUDE clause in Window function framing

2016-04-04 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-14361:
---
Description: 
The current Spark SQL does not support the *exclude* clause in Window function 
framing clause, which is part of ANSI SQL2003's Window syntax. For example, IBM 
Netezza fully supports it as shown in the 
https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html)..
 We propose to implement it in the JIRA.

The ANSI SQL2003's Window syntax:
{code}
FUNCTION_NAME(expr) OVER {window_name | (window_specification)}
window_specification ::= [window_name] [partitioning] [ordering] [framing]
partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name]
ordering ::= ORDER [SIBLINGS] BY rule[, rule...]
rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}]
framing ::= {ROWS | RANGE} {start | between} [exclusion]
start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW}
between ::= BETWEEN bound AND bound
bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING}
exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE NO 
OTHERS}
{code}


  was:The current Spark SQL does not support the {code}exclude{code} clause in 
Window function framing clause, which is part of ANSI SQL2003's 


> Support EXCLUDE clause in Window function framing
> -
>
> Key: SPARK-14361
> URL: https://issues.apache.org/jira/browse/SPARK-14361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> The current Spark SQL does not support the *exclude* clause in Window 
> function framing clause, which is part of ANSI SQL2003's Window syntax. For 
> example, IBM Netezza fully supports it as shown in the 
> https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html)..
>  We propose to implement it in the JIRA.
> The ANSI SQL2003's Window syntax:
> {code}
> FUNCTION_NAME(expr) OVER {window_name | (window_specification)}
> window_specification ::= [window_name] [partitioning] [ordering] [framing]
> partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name]
> ordering ::= ORDER [SIBLINGS] BY rule[, rule...]
> rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}]
> framing ::= {ROWS | RANGE} {start | between} [exclusion]
> start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW}
> between ::= BETWEEN bound AND bound
> bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING}
> exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE 
> NO OTHERS}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14361) Support EXCLUDE clause in Window function framing

2016-04-03 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-14361:
---
Description: The current Spark SQL does not support the {code}exclude{code} 
clause in Window function framing clause, which is part of ANSI SQL2003's   
(was: The current Spark SQL does not support the `exclusion` clause, which is 
part of ANSI 

SQL2003’s `Window` syntax. For example, IBM Netezza fully supports it as shown 
in the 

[document web link]

(https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_wi

ndow_aggregation_family_syntax.html). We propose to support it in this JIRA. 


# Introduction

Below is the ANSI SQL2003’s `Window` syntax:
```
FUNCTION_NAME(expr) OVER {window_name | (window_specification)}
window_specification ::= [window_name] [partitioning] [ordering] [framing]
partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name]
ordering ::= ORDER [SIBLINGS] BY rule[, rule...]
rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}]
framing ::= {ROWS | RANGE} {start | between} [exclusion]
start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW}
between ::= BETWEEN bound AND bound
bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING}
exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE NO 
OTHERS}
```
Exclusion clause can be used to excluded certain rows from the window framing 
when 

calculating window aggregation function (e.g. AVG, SUM, MAX, MIN, COUNT, etc)  
related 

to current row. Types of window functions that are not supported are listed 
below:

1. Offset functions, such as lead(), lag()
2. Ranking functions, such as rank(), dense_rank(), percent_rank(), cume_dist, 
ntile()
3. Row number function, such as row_number()

# Definition
Syntax | Description
 | -
EXCLUDE CURRENT ROW | Specifies excluding the current row.
EXCLUDE GROUP | Specifies excluding the current row and all rows that are tied 
with it. 

Ties occur when there is a match on the order column or columns.
EXCLUDE NO OTHERS | Specifies not excluding any rows. This value is the default 
if you 

specify no exclusion.
EXCLUDE TIES | Specifies excluding all rows that are tied with the current row 
(peer 

rows), but retaining the current row.

# Use-case Examples:

- Let's say you want to find out for every employee, where is his/her salary at 
compared 

to the average salary of those within the same department and whose ages are 
within 5 

years younger and older. The query could be:

```SQL
SELECT NAME, DEPT_ID, SALARY, AGE, AVG(SALARY) AS AVG_WITHIN_5_YEAR
OVER(PARTITION BY DEPT_ID 
 ORDER BY AGE 
 RANGE BETWEEN 5 PRECEDING AND 5 FOLLOWING 
 EXCLUDE CURRENT ROW) 
FROM EMPLOYEE

```

- Let's say you want to compare every customer's yearly purchase with other 
customers' 

average yearly purchase who are at different age group from the current 
customer. The 

query could be:

```SQL
SELECT CUST_NAME, AGE, PROD_CATEGORY, YEARLY_PURCHASE, AVG(YEARLY_PURCHASE) 
OVER(PARTITION BY PROD_CATEGORY 
 ORDER BY AGE 
 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUND FOLLOWING 
 EXCLUDE GROUP) 
FROM CUSTOMER_PURCHASE_SUM

```)

> Support EXCLUDE clause in Window function framing
> -
>
> Key: SPARK-14361
> URL: https://issues.apache.org/jira/browse/SPARK-14361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> The current Spark SQL does not support the {code}exclude{code} clause in 
> Window function framing clause, which is part of ANSI SQL2003's 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14361) Support EXCLUDE clause in Window function framing

2016-04-03 Thread Xin Wu (JIRA)
Xin Wu created SPARK-14361:
--

 Summary: Support EXCLUDE clause in Window function framing
 Key: SPARK-14361
 URL: https://issues.apache.org/jira/browse/SPARK-14361
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xin Wu


The current Spark SQL does not support the `exclusion` clause, which is part of 
ANSI 

SQL2003’s `Window` syntax. For example, IBM Netezza fully supports it as shown 
in the 

[document web link]

(https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_wi

ndow_aggregation_family_syntax.html). We propose to support it in this JIRA. 


# Introduction

Below is the ANSI SQL2003’s `Window` syntax:
```
FUNCTION_NAME(expr) OVER {window_name | (window_specification)}
window_specification ::= [window_name] [partitioning] [ordering] [framing]
partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name]
ordering ::= ORDER [SIBLINGS] BY rule[, rule...]
rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}]
framing ::= {ROWS | RANGE} {start | between} [exclusion]
start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW}
between ::= BETWEEN bound AND bound
bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING}
exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE NO 
OTHERS}
```
Exclusion clause can be used to excluded certain rows from the window framing 
when 

calculating window aggregation function (e.g. AVG, SUM, MAX, MIN, COUNT, etc)  
related 

to current row. Types of window functions that are not supported are listed 
below:

1. Offset functions, such as lead(), lag()
2. Ranking functions, such as rank(), dense_rank(), percent_rank(), cume_dist, 
ntile()
3. Row number function, such as row_number()

# Definition
Syntax | Description
 | -
EXCLUDE CURRENT ROW | Specifies excluding the current row.
EXCLUDE GROUP | Specifies excluding the current row and all rows that are tied 
with it. 

Ties occur when there is a match on the order column or columns.
EXCLUDE NO OTHERS | Specifies not excluding any rows. This value is the default 
if you 

specify no exclusion.
EXCLUDE TIES | Specifies excluding all rows that are tied with the current row 
(peer 

rows), but retaining the current row.

# Use-case Examples:

- Let's say you want to find out for every employee, where is his/her salary at 
compared 

to the average salary of those within the same department and whose ages are 
within 5 

years younger and older. The query could be:

```SQL
SELECT NAME, DEPT_ID, SALARY, AGE, AVG(SALARY) AS AVG_WITHIN_5_YEAR
OVER(PARTITION BY DEPT_ID 
 ORDER BY AGE 
 RANGE BETWEEN 5 PRECEDING AND 5 FOLLOWING 
 EXCLUDE CURRENT ROW) 
FROM EMPLOYEE

```

- Let's say you want to compare every customer's yearly purchase with other 
customers' 

average yearly purchase who are at different age group from the current 
customer. The 

query could be:

```SQL
SELECT CUST_NAME, AGE, PROD_CATEGORY, YEARLY_PURCHASE, AVG(YEARLY_PURCHASE) 
OVER(PARTITION BY PROD_CATEGORY 
 ORDER BY AGE 
 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUND FOLLOWING 
 EXCLUDE GROUP) 
FROM CUSTOMER_PURCHASE_SUM

```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14346) SHOW CREATE TABLE command (Native)

2016-04-02 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-14346:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-14118

> SHOW CREATE TABLE command (Native)
> --
>
> Key: SPARK-14346
> URL: https://issues.apache.org/jira/browse/SPARK-14346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> This command will return a CREATE TABLE command in SQL. Right now, we just 
> throw exception (I was not sure how often people will use it). Since it is a 
> pretty standalone work (generating a CREATE TABLE command based on the 
> metadata of a table) and people may find it pretty useful, I am thinking to 
> get it in 2.0. Hive's implementation can be found at 
> https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L1898-L2126.
>  The main difference for spark is that if we have a data source table, we 
> should use Spark's syntax (CREATE TABLE ... USING ... OPTIONS) instead of 
> Hive's syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14346) SHOW CREATE TABLE command (Native)

2016-04-02 Thread Xin Wu (JIRA)
Xin Wu created SPARK-14346:
--

 Summary: SHOW CREATE TABLE command (Native)
 Key: SPARK-14346
 URL: https://issues.apache.org/jira/browse/SPARK-14346
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xin Wu


This command will return a CREATE TABLE command in SQL. Right now, we just 
throw exception (I was not sure how often people will use it). Since it is a 
pretty standalone work (generating a CREATE TABLE command based on the metadata 
of a table) and people may find it pretty useful, I am thinking to get it in 
2.0. Hive's implementation can be found at 
https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L1898-L2126.
 The main difference for spark is that if we have a data source table, we 
should use Spark's syntax (CREATE TABLE ... USING ... OPTIONS) instead of 
Hive's syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14096) SPARK-SQL CLI returns NPE

2016-03-24 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210978#comment-15210978
 ] 

Xin Wu commented on SPARK-14096:


After I comment out the kryo serialization setting in SparkSQLEnv.init, that is 
used by spark-sql console. the query returns without NPE.  
When kryo serialization is used, the query fails when ORDER BY and LIMIT is 
combined. After removing either ORDER BY or LIMIT clause, the query also runs. 

> SPARK-SQL CLI returns NPE
> -
>
> Key: SPARK-14096
> URL: https://issues.apache.org/jira/browse/SPARK-14096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> Trying to run TPCDS query 06 in spark-sql shell received the following error 
> in the middle of a stage; but running another query 38 succeeded:
> NPE:
> {noformat}
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage 
> 10.0 (TID 622) in 171 ms on localhost (30/200)
> 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting 
> task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
>   ... 15 more
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage 
> 10.0 (TID 623) in 171 ms on localhost (31/200)
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> {noformat}
> query 06 (caused the above NPE):
> {noformat}
>  select  a.ca_state state, count(*) cnt
>  from customer_address a
>  join customer c on a.ca_address_sk = c.c_current_addr_sk
>  join store_sales s on c.c_customer_sk = s.ss_customer_sk
>  join date_dim d on s.ss_sold_date_sk = d.d_date_sk
>  join item i on s.ss_item_sk = i.i_item_sk
>  join (select distinct d_month_seq
> from date_dim
>where d_year = 2001
>   and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq
>  join
>   (select j.i_category, avg(j.i_current_price) as 

[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-23 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209748#comment-15209748
 ] 

Xin Wu commented on SPARK-13832:


[~jfc...@us.ibm.com]  For the above execution issue, i think it is a duplicate 
of SPARK-14096. I think you can close this JIRA and refer to SPARK-14096 for 
the kyro exception issue.  Thanks!

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14096) SPARK-SQL CLI returns NPE

2016-03-23 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209737#comment-15209737
 ] 

Xin Wu commented on SPARK-14096:


I simplied the query to:
{code}select * from item order by i_item_id limit 100;{code}
And it fails with exception:{code}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
at java.util.PriorityQueue.offer(PriorityQueue.java:344)
at java.util.PriorityQueue.add(PriorityQueue.java:321)
at 
com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
at 
com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
{code}
And removing either "ORDER BY" or "LIMIT" clause will pass.. 

> SPARK-SQL CLI returns NPE
> -
>
> Key: SPARK-14096
> URL: https://issues.apache.org/jira/browse/SPARK-14096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> Trying to run TPCDS query 06 in spark-sql shell received the following error 
> in the middle of a stage; but running another query 38 succeeded:
> NPE:
> {noformat}
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage 
> 10.0 (TID 622) in 171 ms on localhost (30/200)
> 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting 
> task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
>   ... 15 more
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all 

[jira] [Commented] (SPARK-14096) SPARK-SQL CLI returns NPE

2016-03-23 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209688#comment-15209688
 ] 

Xin Wu commented on SPARK-14096:


[~jfc...@us.ibm.com] Can you try the query without ORDER BY? I noticed another 
query failed with the ORDER BY while succeeded without the ORDER BY. 

> SPARK-SQL CLI returns NPE
> -
>
> Key: SPARK-14096
> URL: https://issues.apache.org/jira/browse/SPARK-14096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> Trying to run TPCDS query 06 in spark-sql shell received the following error 
> in the middle of a stage; but running another query 38 succeeded:
> NPE:
> {noformat}
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage 
> 10.0 (TID 622) in 171 ms on localhost (30/200)
> 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting 
> task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
>   ... 15 more
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage 
> 10.0 (TID 623) in 171 ms on localhost (31/200)
> 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, 
> whose tasks have all completed, from pool 
> {noformat}
> query 06 (caused the above NPE):
> {noformat}
>  select  a.ca_state state, count(*) cnt
>  from customer_address a
>  join customer c on a.ca_address_sk = c.c_current_addr_sk
>  join store_sales s on c.c_customer_sk = s.ss_customer_sk
>  join date_dim d on s.ss_sold_date_sk = d.d_date_sk
>  join item i on s.ss_item_sk = i.i_item_sk
>  join (select distinct d_month_seq
> from date_dim
>where d_year = 2001
>   and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq
>  join
>   (select j.i_category, avg(j.i_current_price) as avg_i_current_price
>from item j group by j.i_category) tmp2 on tmp2.i_category = 
> i.i_category
>  where  
>   i.i_current_price > 1.2 * 

[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-23 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209687#comment-15209687
 ] 

Xin Wu commented on SPARK-13832:


The analysis issue reported in this jira is resolved in spark 2.0.. 
For the kyro exception during execution, the query can return without the ORDER 
BY.. so I am also looking into why ORDER BY clause triggers this kyro 
Exception. 

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-23 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209220#comment-15209220
 ] 

Xin Wu commented on SPARK-13832:


[~jfc...@us.ibm.com] I think when using grouping_id(), you need to pass in all 
the columns that are in the group by clause. In this case, it will be 
grouping_id(i_category, i_class). The result is like concatenating results of 
grouping() into a bit vector (a string of ones and zeros), such as 
grouping(i_category)+grouping(i_class)

So {code}grouping_id(i_category)+grouping_id(i_class){code} is not correct. 
After I changed to use {code}grouping_id(i_category, i_class){code}, the query 
returns for the text data files.. 
I am trying for the parquet files now. 



> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set

2016-03-20 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200053#comment-15200053
 ] 

Xin Wu commented on SPARK-13863:


Jesse, after I modified the DDL to use "decimal(7,2)" for the "double" colums 
as documented in the tpc-ds specs and the query return the following results 
both from Hive and Spark SQL:

Spark SQL:
{code}
NULLNULLFairviewWilliamson County   TN  United States   
DHL,BARIAN  20019597806.95  11121820.57 8670867.91  
8994786.04  10887248.09 14187671.36 9732598.41  19798897.07 
21007842.34 21495513.67 34795669.17 33122997.94 NULLNULL
NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL
21913594.59 32518476.51 24885662.72 25698343.86 33735910.61 
35527031.58 25465193.48 53623238.66 51409986.76 54159173.9  
92227043.25 83435390.84
Bad cards must make.621234  FairviewWilliamson County   TN  
United States   DHL,BARIAN  20019506753.46  8008140.33  
6116769.63  11973045.15 7756254.92  5352978.49  13733996.1  
16418794.37 17212743.32 17042707.41 34304935.61 35324164.21 
15.303015385507 12.89069871 9.846160432301  19.273003650798 12.485238927683 
8.616686288902  22.107605346777 26.429323523825 27.707342676029 27.433635972918 
55.220634430827 56.861286101534 30534943.77 24481685.94 22178710.81 
25695798.18 29954903.78 18084140.05 30805576.13 47156887.22 
51158588.86 55759942.8  86253544.16 83451555.63
Conventional childr 977787  FairviewWilliamson County   TN  
United States   DHL,BARIAN  20018860645.55  14415813.74 
6761497.23  11820654.76 8246260.69  6636877.49  11434492.25 
25673812.14 23074206.96 21834581.94 26894900.53 33575091.74 
9.061938387399  14.743306814265 6.915102399603  12.089191981484 8.433596161537  
6.787651594877  11.694256775759 26.257060218637 23.598398178745 22.330611820366 
27.50538776 34.337838138572 23836085.83 32073313.37 25037904.18 
22659895.86 21757401.03 24451608.1  21933001.85 55996703.43 
57371880.44 62087214.51 82849910.15 88970319.31
Doors canno 294242  FairviewWilliamson County   TN  United 
States   DHL,BARIAN  20016355232.31  10198920.36 10246200.97
 12209716.5  8566998.28  8806316.81  9789405.6   16466584.88
 26443785.61 27016047.8  33660589.67 27462468.62 
21.598657941422 34.66167426812  34.822360404021 41.495491806065 29.115484125312 
29.928823247531 33.269912520986 55.962727550791 89.870873668613 91.815742823934 
114.39763755684193.332932144289 22645143.09 24487254.6  
24925759.42 30503655.27 26558160.29 20976233.52 29895796.09 
56002198.38 53488158.53 76287235.46 82483747.59 88088266.69
Important issues liv138504  FairviewWilliamson County   TN  
United States   DHL,BARIAN  200111748784.55 14351305.77 
9896470.93  7990874.78  8879247.9   7362383.09  10011144.75 
17741201.32 21346976.05 18074978.16 29675125.64 32545325.29 
84.826319456478 103.61654370992971.452600141512 57.694180529082 
64.108241639231 53.156465445041 72.280546049212 128.091616993011
154.12533970138 130.501488476867214.254647086005
234.97751176861427204167.15 25980378.13 19943398.93 
25710421.13 19484481.03 26346611.48 25075158.43 54094778.13 
41066732.11 54547058.28 72465962.92 92770328.27
{code}

Hive:
{code}
NULLNULLFairviewWilliamson County   TN  United States   
DHL,BARIAN  20019597806.95  11121820.57 8670867.91  
8994786.04  10887248.09 14187671.36 9732598.41  19798897.07 
21007842.34 21495513.67 34795669.17 33122997.94 NULLNULL
NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL
21913594.59 32518476.51 24885662.72 25698343.86 33735910.61 
35527031.58 25465193.48 53623238.66 51409986.76 54159173.9  
92227043.25 83435390.84
Bad cards must make.621234  FairviewWilliamson County   TN  
United States   DHL,BARIAN  20019506753.46  8008140.33  
6116769.63  11973045.15 7756254.92  5352978.49  13733996.1  
16418794.37 17212743.32 17042707.41 34304935.61 35324164.21 
15.303015385507 12.89069871 9.846160432301  19.273003650798 12.485238927683 
8.616686288902  22.107605346777 26.429323523825 27.707342676029 27.433635972918 
55.220634430827 56.861286101534 30534943.77 

[jira] [Commented] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set

2016-03-19 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200160#comment-15200160
 ] 

Xin Wu commented on SPARK-13863:


In terms of the ordering. the only difference is that the row with Null value 
for the order by column (w_warehouse_name) is placed at the top for HIve and 
Spark SQL, while the expected result has it at the bottom. Other rows are OK. 
So the it seems the expected results have NULL row in the wrong place. 

> TPCDS query 66 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13863
> URL: https://issues.apache.org/jira/browse/SPARK-13863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 66 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> Aggregations slightly off -- eg. JAN_SALES column of "Doors canno"  row - 
> SparkSQL returns 6355232.185385704, expected 6355232.31
> Actual results:
> {noformat}
> [null,null,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,9597806.850651741,1.1121820530080795E7,8670867.81564045,8994785.945689201,1.088724806326294E7,1.4187671518377304E7,9732598.460139751,1.9798897020946026E7,2.1007842467959404E7,2.149551364927292E7,3.479566905774999E7,3.3122997954660416E7,null,null,null,null,null,null,null,null,null,null,null,null,2.191359469742E7,3.2518476414670944E7,2.48856624883976E7,2.5698343830046654E7,3.373591080598068E7,3.552703167087555E7,2.5465193481492043E7,5.362323870799959E7,5.1409986978201866E7,5.415917383586836E7,9.222704311805725E7,8.343539111531019E7]
> [Bad cards must make.,621234,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,9506753.593884468,8008140.429557085,6116769.711647987,1.1973045160133362E7,7756254.925520897,5352978.574095726,1.373399613500309E7,1.6418794411203384E7,1.7212743279764652E7,1.704270732417488E7,3.43049358570323E7,3.532416421229005E7,15.30301560102066,12.890698882477594,9.846160563729589,19.273003667109915,12.485238936569628,8.61668642427125,22.107605403121994,26.429323590150222,27.707342611261865,27.433635834765774,55.22063482847413,56.86128610521969,3.0534943928382874E7,2.4481686250203133E7,2.217871080008793E7,2.569579825610423E7,2.995490355044937E7,1.8084140250833035E7,3.0805576178061485E7,4.7156887432252884E7,5.115858869637826E7,5.5759943171424866E7,8.625354428184557E7,8.345155532035494E7]
> [Conventional childr,977787,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,8860645.460736752,1.441581376543355E7,6761497.232810497,1.1820654735879421E7,8246260.600341797,6636877.482845306,1.1434492123092413E7,2.5673812070380323E7,2.307420611785E7,2.1834582007320404E7,2.6894900596512794E7,3.357509177109933E7,9.061938296108202,14.743306840276613,6.9151024024767125,12.08919195681618,8.43359606984118,6.787651587559771,11.694256645969329,26.257060147435304,23.598398219562938,22.330611889215547,27.505888906799534,34.337838170377935,2.3836085704864502E7,3.20733132298584E7,2.503790437837982E7,2.2659895963564873E7,2.175740087420273E7,2.4451608012176514E7,2.1933001734852314E7,5.59967034604629E7,5.737188052299309E7,6.208721474336243E7,8.284991027382469E7,8.897031933202875E7]
> [Doors canno,294242,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,6355232.185385704,1.0198920296742141E7,1.0246200903741479E7,1.2209716492156029E7,8566998.262890816,8806316.75278151,9789405.6993227,1.646658496404171E7,2.6443785668474197E7,2.701604788320923E7,3.366058958298761E7,2.7462468750599384E7,21.59865751791282,34.66167405313361,34.822360178837414,41.495491779406166,29.115484067165177,29.928823053070296,33.26991285854059,55.96272783641258,89.87087386734116,91.81574310672585,114.39763726112386,93.33293258813964,2.2645142994330406E7,2.448725452685547E7,2.4925759290207863E7,3.0503655031727314E7,2.6558160276379585E7,2.0976233452690125E7,2.9895796101181984E7,5.600219855566597E7,5.348815865275085E7,7.628723580410767E7,8.248374754962921E7,8.808826726185608E7]
> [Important issues liv,138504,Fairview,Williamson County,TN,United 
> 

[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-18 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202412#comment-15202412
 ] 

Xin Wu edited comment on SPARK-13832 at 3/19/16 12:29 AM:
--

Jesse, you are right.. With "grouping" function, the query hits the 
{code}com.esotericsoftware.kryo.KryoException{code}, even with text data file. 
So this case, we passed the analyzer.

With grouping_id on column i_category, the query hits the analyzer issue. 
{code}Error in query: Columns of grouping_id...{code}

I will continue digging in. 


was (Author: xwu0226):
Jesse, you are right.. With "grouping" function, the query hits the 
{code}com.esotericsoftware.kryo.KryoException{code}, even thought with text 
data file. So this case, we passed the analyzer.

With grouping_id on column i_category, the query hits the analyzer issue. 
{code}Error in query: Columns of grouping_id...{code}

I will continue digging in. 

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-18 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202412#comment-15202412
 ] 

Xin Wu commented on SPARK-13832:


Jesse, you are right.. With "grouping" function, the query hits the 
{code}com.esotericsoftware.kryo.KryoException{code}, even thought with text 
data file. So this case, we passed the analyzer.

With grouping_id on column i_category, the query hits the analyzer issue. 
{code}Error in query: Columns of grouping_id...{code}

I will continue digging in. 

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-18 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202264#comment-15202264
 ] 

Xin Wu commented on SPARK-13832:


what i meant is that in Spark 2.0, it seems that "grouping__id" is deprecated 
and grouping_id() is used. So i needed to change this to proceed. but after the 
query is parsed, the AnalysisException you reported in this JIRA  
{code}"org.apache.spark.sql.AnalysisException: expression 
'i_category'..."{code} is not reproducible.  As far as the later execution 
error, I am still validating whether it is related to the data or spark sql 
execution issue. But this is not a parser or analyzer error. 

In 1.6, the AnalsysException is reproducible. This this is no longer the issue 
in 2.0.. 


> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-15 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196377#comment-15196377
 ] 

Xin Wu edited comment on SPARK-13832 at 3/16/16 12:39 AM:
--

Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is not reproducible in 
spark 2.0.. Except that I saw execution error related to  
com.esotericsoftware.kryo.KryoException


was (Author: xwu0226):
Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is not reproducible in 
spark 2.0.. Except that I saw execution error maybe related to spark-13862.

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-15 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196377#comment-15196377
 ] 

Xin Wu edited comment on SPARK-13832 at 3/15/16 10:43 PM:
--

Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is not reproducible in 
spark 2.0.. Except that I saw execution error related to spark-13862.


was (Author: xwu0226):
Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is gone.. Except that I 
saw execution error related to kryo.serializers.. that should be a different 
issue and maybe related to my setup. 

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error

2016-03-15 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196377#comment-15196377
 ] 

Xin Wu edited comment on SPARK-13832 at 3/15/16 10:44 PM:
--

Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is not reproducible in 
spark 2.0.. Except that I saw execution error maybe related to spark-13862.


was (Author: xwu0226):
Trying this query in Spark 2.0 and I needed to change grouping__id to 
grouping_id() to pass the parser. The reported error is not reproducible in 
spark 2.0.. Except that I saw execution error related to spark-13862.

> TPC-DS Query 36 fails with Parser error
> ---
>
> Key: SPARK-13832
> URL: https://issues.apache.org/jira/browse/SPARK-13832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS query 36 fails with the following error
> Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'i_category' is neither present in the group by, nor is it an aggregate 
> function. Add to group by or wrap in first() (or first_value) if you don't 
> care which value you get.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> Query Text pasted here for quick reference.
>   select
> sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin
>,i_category
>,i_class
>,grouping__id as lochierarchy
>,rank() over (
> partition by grouping__id,
> case when grouping__id = 0 then i_category end
> order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as 
> rank_within_parent
>  from
> store_sales
>,date_dim   d1
>,item
>,store
>  where
> d1.d_year = 2001
>  and d1.d_date_sk = ss_sold_date_sk
>  and i_item_sk  = ss_item_sk
>  and s_store_sk  = ss_store_sk
>  and s_state in ('TN','TN','TN','TN',
>  'TN','TN','TN','TN')
>  group by i_category,i_class WITH ROLLUP
>  order by
>lochierarchy desc
>   ,case when lochierarchy = 0 then i_category end
>   ,rank_within_parent
> limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >