[jira] [Created] (SPARK-31184) Support getTablesByType API of Hive Client
Xin Wu created SPARK-31184: -- Summary: Support getTablesByType API of Hive Client Key: SPARK-31184 URL: https://issues.apache.org/jira/browse/SPARK-31184 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Xin Wu Hive 2.3+ supports getTablesByType API, which is a precondition to implement SHOW VIEWS in HiveExternalCatalog. Currently, without this API, we can not get hive table with type HiveTableType.VIRTUAL_VIEW directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31113) Support DDL "SHOW VIEWS"
[ https://issues.apache.org/jira/browse/SPARK-31113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056669#comment-17056669 ] Xin Wu commented on SPARK-31113: Sure, I'm working on this! Thanks [~smilegator] > Support DDL "SHOW VIEWS" > > > Key: SPARK-31113 > URL: https://issues.apache.org/jira/browse/SPARK-31113 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > It is nice to have a `SHOW VIEWS` command similar to Hive > (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31079) Add RuleExecutor metrics in Explain Formatted
Xin Wu created SPARK-31079: -- Summary: Add RuleExecutor metrics in Explain Formatted Key: SPARK-31079 URL: https://issues.apache.org/jira/browse/SPARK-31079 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Xin Wu RuleExecutor already support metering for analyzer/optimizer. By providing such information in Explain command, user can get better user experience when debugging a specific query. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30940) Remove meaningless attributeId when Explain SQL query
Xin Wu created SPARK-30940: -- Summary: Remove meaningless attributeId when Explain SQL query Key: SPARK-30940 URL: https://issues.apache.org/jira/browse/SPARK-30940 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Xin Wu When EXPLAIN sql query, the generated alias shouldn't include expr/attribute id. This will provide better readability of Explain results. This is a follow-up to address [#27368 (comment)|https://github.com/apache/spark/pull/27368#discussion_r376927143]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30842) Adjust abstraction structure for join operators
Xin Wu created SPARK-30842: -- Summary: Adjust abstraction structure for join operators Key: SPARK-30842 URL: https://issues.apache.org/jira/browse/SPARK-30842 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Xin Wu Currently the join operators are not well abstracted, since there are lot of common logic. A trait can be created for easier pattern matching and other future handiness. This is a follow up based on comment [https://github.com/apache/spark/pull/27509#discussion_r379613391] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30765) Refine baes class abstraction code style
Xin Wu created SPARK-30765: -- Summary: Refine baes class abstraction code style Key: SPARK-30765 URL: https://issues.apache.org/jira/browse/SPARK-30765 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Xin Wu When doing base operator abstraction work, I found there are still some code snippet is inconsistent with other abstraction code style. Case 1, override keyword missed for some fields in derived classes. The compiler will not capture it if we rename some fields in the future. [https://github.com/apache/spark/pull/27368#discussion_r376694045] Case 2, inconsistent abstract class definition. The updated style will simplify derived class definition. [https://github.com/apache/spark/pull/27368#discussion_r375061952] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30764) Improve the readability of EXPLAIN FORMATTED style
Xin Wu created SPARK-30764: -- Summary: Improve the readability of EXPLAIN FORMATTED style Key: SPARK-30764 URL: https://issues.apache.org/jira/browse/SPARK-30764 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Xin Wu The style of EXPLAIN FORMATTED output needs to be improved. We’ve already got some observations/ideas in [https://github.com/apache/spark/pull/27368#discussion_r376694496]. TODOs: 1.Using comma as the separator is not clear, especially commas are used inside the expressions too. 2.Show the column counts first? For example, `Results [4]: …` 3.Currently the attribute names are automatically generated, this need to refined. 4.Add arguments field in common implications as EXPLAIN FORMATTED did in QueryPlan ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30652) EXPLAIN EXTENDED does not show detail information for aggregate operators
[ https://issues.apache.org/jira/browse/SPARK-30652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu resolved SPARK-30652. Resolution: Duplicate > EXPLAIN EXTENDED does not show detail information for aggregate operators > - > > Key: SPARK-30652 > URL: https://issues.apache.org/jira/browse/SPARK-30652 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xin Wu >Priority: Major > > Currently EXPLAIN FORMATTED only report input attributes of > HashAggregate/ObjectHashAggregate/SortAggregate. While EXPLAIN EXTENDED > provides more information. We need to enhance EXPLAIN FORMATTED to follow the > original behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30652) EXPLAIN EXTENDED does not show detail information for aggregate operators
Xin Wu created SPARK-30652: -- Summary: EXPLAIN EXTENDED does not show detail information for aggregate operators Key: SPARK-30652 URL: https://issues.apache.org/jira/browse/SPARK-30652 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Xin Wu Currently EXPLAIN FORMATTED only report input attributes of HashAggregate/ObjectHashAggregate/SortAggregate. While EXPLAIN EXTENDED provides more information. We need to enhance EXPLAIN FORMATTED to follow the original behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30651) EXPLAIN EXTENDED does not show detail information for aggregate operators
Xin Wu created SPARK-30651: -- Summary: EXPLAIN EXTENDED does not show detail information for aggregate operators Key: SPARK-30651 URL: https://issues.apache.org/jira/browse/SPARK-30651 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Xin Wu Currently EXPLAIN FORMATTED only report input attributes of HashAggregate/ObjectHashAggregate/SortAggregate. While EXPLAIN EXTENDED provides more information. We need to enhance EXPLAIN FORMATTED to follow the original behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30326) Raise exception if analyzer exceed max iterations
[ https://issues.apache.org/jira/browse/SPARK-30326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-30326: --- Description: Currently, both analyzer and optimizer just log warning message if rule execution exceed max iterations. They should have different behavior. Analyzer should raise exception to indicates the plan is not fixed after max iterations, while optimizer just log warning to keep the current plan. This is more feasible after SPARK-30138 was introduced. (was: Currently, both analyzer and optimizer just log warning message if rule execution exceed max iterations. They should have different behavior. Analyzer should raise exception to indicates logical plan resolve failed, while optimizer just log warning to keep the current plan. This is more feasible after SPARK-30138 was introduced.) > Raise exception if analyzer exceed max iterations > - > > Key: SPARK-30326 > URL: https://issues.apache.org/jira/browse/SPARK-30326 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xin Wu >Priority: Major > > Currently, both analyzer and optimizer just log warning message if rule > execution exceed max iterations. They should have different behavior. > Analyzer should raise exception to indicates the plan is not fixed after max > iterations, while optimizer just log warning to keep the current plan. This > is more feasible after SPARK-30138 was introduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30326) Raise exception if analyzer exceed max iterations
Xin Wu created SPARK-30326: -- Summary: Raise exception if analyzer exceed max iterations Key: SPARK-30326 URL: https://issues.apache.org/jira/browse/SPARK-30326 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Xin Wu Currently, both analyzer and optimizer just log warning message if rule execution exceed max iterations. They should have different behavior. Analyzer should raise exception to indicates logical plan resolve failed, while optimizer just log warning to keep the current plan. This is more feasible after SPARK-30138 was introduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987976#comment-15987976 ] Xin Wu commented on SPARK-18727: Thanks! > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987939#comment-15987939 ] Xin Wu commented on SPARK-18727: [~ekhliang] I see. I will try to support ALTER TABLE SCHEMA. Also this is similar or the same as ALTER TABLE REPLACE COLUMNS, which is documented as unsupported Hive feature in SqlBase.q4.. Do we have preference which one to use? > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987874#comment-15987874 ] Xin Wu commented on SPARK-18727: [~ekhliang] First of all, i am not sure whether it is wise to introduce more non-SQL standard syntax into Spark's DDL. In addition, with ALTER TABLE SCHEMA, or ALTER TABLE SET/UPDATE/MOIDFY SCHEMA, depending however we call it, it requires users to put in the whole list of columns' definition for maybe a small change of a column. It is inconvenient especially when the table is relatively wide. What do you think [~smilegator] ? > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987830#comment-15987830 ] Xin Wu commented on SPARK-18727: [~simeons] You are right.. My PR does not include the feature that allows you to add new field into a complex type. Such feature could be supported by {code}ALTER TABLE CHANGE COLUMN {code}, where newType has newly added fields. I am also working on this part. > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987642#comment-15987642 ] Xin Wu commented on SPARK-18727: FYI. I have https://github.com/apache/spark/pull/16626 for ALTER TABLE ADD COLUMNS merged into 2.2. > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir
[ https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964969#comment-15964969 ] Xin Wu commented on SPARK-20256: Yes. I am working on it. My proposal is to revert the SPARK-18050 change, then add a try-catch over externalCatalog.createDatabase(...) and log the error of existing default database from Hive into DEBUG log. I am trying to create a unit-test case to simulate the permission issue, which I have some difficulty. > Fail to start SparkContext/SparkSession with Hive support enabled when user > does not have read/write privilege to Hive metastore warehouse dir > -- > > Key: SPARK-20256 > URL: https://issues.apache.org/jira/browse/SPARK-20256 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Xin Wu >Priority: Critical > > In a cluster setup with production Hive running, when the user wants to run > spark-shell using the production Hive metastore, hive-site.xml is copied to > SPARK_HOME/conf. So when spark-shell is being started, it tries to check > database existence of "default" database from Hive metastore. Yet, since this > user may not have READ/WRITE access to the configured Hive warehouse > directory done by Hive itself, such permission error will prevent spark-shell > or any spark application with Hive support enabled from starting at all. > Example error: > {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > java.lang.IllegalArgumentException: Error while instantiating > 'org.apache.spark.sql.hive.HiveSessionState': > at > org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981) > at > org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110) > at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > ... 47 elided > Caused by: java.lang.reflect.InvocationTargetException: > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:java.security.AccessControlException: Permission > denied: user=notebook, access=READ, > inode="/apps/hive/warehouse":hive:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at >
[jira] [Commented] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir
[ https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961193#comment-15961193 ] Xin Wu commented on SPARK-20256: I am working on a fix and creating simulated test cases for this issue. > Fail to start SparkContext/SparkSession with Hive support enabled when user > does not have read/write privilege to Hive metastore warehouse dir > -- > > Key: SPARK-20256 > URL: https://issues.apache.org/jira/browse/SPARK-20256 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Xin Wu >Priority: Critical > > In a cluster setup with production Hive running, when the user wants to run > spark-shell using the production Hive metastore, hive-site.xml is copied to > SPARK_HOME/conf. So when spark-shell is being started, it tries to check > database existence of "default" database from Hive metastore. Yet, since this > user may not have READ/WRITE access to the configured Hive warehouse > directory done by Hive itself, such permission error will prevent spark-shell > or any spark application with Hive support enabled from starting at all. > Example error: > {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > java.lang.IllegalArgumentException: Error while instantiating > 'org.apache.spark.sql.hive.HiveSessionState': > at > org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981) > at > org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110) > at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > ... 47 elided > Caused by: java.lang.reflect.InvocationTargetException: > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:java.security.AccessControlException: Permission > denied: user=notebook, access=READ, > inode="/apps/hive/warehouse":hive:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) > ); > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at >
[jira] [Updated] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir
[ https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-20256: --- Description: In a cluster setup with production Hive running, when the user wants to run spark-shell using the production Hive metastore, hive-site.xml is copied to SPARK_HOME/conf. So when spark-shell is being started, it tries to check database existence of "default" database from Hive metastore. Yet, since this user may not have READ/WRITE access to the configured Hive warehouse directory done by Hive itself, such permission error will prevent spark-shell or any spark application with Hive support enabled from starting at all. Example error: {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState': at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981) at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) ... 47 elided Caused by: java.lang.reflect.InvocationTargetException: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=notebook, access=READ, inode="/apps/hive/warehouse":hive:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) ); at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978) ... 58 more Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=notebook, access=READ, inode="/apps/hive/warehouse":hive:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320) at
[jira] [Created] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir
Xin Wu created SPARK-20256: -- Summary: Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir Key: SPARK-20256 URL: https://issues.apache.org/jira/browse/SPARK-20256 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.1.1, 2.2.0 Reporter: Xin Wu Priority: Critical In a cluster setup with production Hive running, when the user wants to run spark-shell using the production Hive metastore, hive-site.xml is copied to SPARK_HOME/conf. So when spark-shell is being started, it tries to check database existence of "default" database from Hive metastore. Yet, since this user may not have READ/WRITE access to the configured Hive warehouse directory done by Hive itself, such permission error will prevent spark-shell or any spark application with Hive support enabled from starting at all. Example error: {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState': at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981) at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) ... 47 elided Caused by: java.lang.reflect.InvocationTargetException: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=notebook, access=READ, inode="/apps/hive/warehouse":hive:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) ); at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978) ... 58 more Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException:
[jira] [Updated] (SPARK-19539) CREATE TEMPORARY TABLE needs to avoid existing temp view
[ https://issues.apache.org/jira/browse/SPARK-19539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-19539: --- Summary: CREATE TEMPORARY TABLE needs to avoid existing temp view (was: CREATE TEMPORARY TABLE need to avoid existing temp view) > CREATE TEMPORARY TABLE needs to avoid existing temp view > > > Key: SPARK-19539 > URL: https://issues.apache.org/jira/browse/SPARK-19539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xin Wu > > Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to > use "CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" > clause. However, if there is an existing temporary view defined, it is > possible to unintentionally replace this existing view by issuing "CREATE > TEMPORARY TABLE ... " with the same table/view name. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19539) CREATE TEMPORARY TABLE need to avoid existing temp view
Xin Wu created SPARK-19539: -- Summary: CREATE TEMPORARY TABLE need to avoid existing temp view Key: SPARK-19539 URL: https://issues.apache.org/jira/browse/SPARK-19539 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Xin Wu Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to use "CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" clause. However, if there is an existing temporary view defined, it is possible to unintentionally replace this existing view by issuing "CREATE TEMPORARY TABLE ... " with the same table/view name. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]
[ https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu closed SPARK-15463. -- Resolution: Later > Support for creating a dataframe from CSV in Dataset[String] > > > Key: SPARK-15463 > URL: https://issues.apache.org/jira/browse/SPARK-15463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: PJ Fanning > > I currently use Databrick's spark-csv lib but some features don't work with > Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV > support into spark-sql directly, that spark-csv won't be modified. > I currently read some CSV data that has been pre-processed and is in > RDD[String] format. > There is sqlContext.read.json(rdd: RDD[String]) but other formats don't > appear to support the creation of DataFrames based on loading from > RDD[String]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729696#comment-15729696 ] Xin Wu commented on SPARK-18727: I am currently working on ALTER TABLE ADD COLUMNS, to tables with provider = hive and will submit a PR soon. Just wondering whether it will solve part of this JIRA. Please advise! Thanks! > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18539) Cannot filter by nonexisting column in parquet file
[ https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723888#comment-15723888 ] Xin Wu edited comment on SPARK-18539 at 12/6/16 12:46 AM: -- I think we will hit the issue if we use user-specified schema. Here is what I tried in spark-shell built from master branch: {code} val df = spark.range(1).coalesce(1) df.selectExpr("id AS a").write.parquet("/Users/xinwu/spark-test/data/spark-18539") val schema = StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType))) spark.read.option("mergeSchema", "true").schema(schema).parquet("/Users/xinwu/spark-test/data/spark-18539").filter("b is null").count() {code} The exception is {code} Caused by: java.lang.IllegalArgumentException: Column [b] was not found in schema! at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58) at org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58) at org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:308) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:63) at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) at org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377) {code} Here I have one parquet file missing column b and query with user-specified schema (a, b). was (Author: xwu0226): I think we will hit the issue if we use user-specified schema. Here is what I tried in spark-shell built from master branch: {code} val df = spark.range(1).coalesce(1) df.selectExpr("id AS a").write.parquet("/Users/xinwu/spark-test/data/spark-18539") val schema = StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType))) spark.read.option("mergeSchema", "true").schema(schema).parquet("/Users/xinwu/spark-test/data/spark-18539").filter("b < 0").count() {code} The exception is {code} Caused by: java.lang.IllegalArgumentException: Column [b] was not found in schema! at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58) at org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58) at org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:308) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:63) at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) at
[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file
[ https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723888#comment-15723888 ] Xin Wu commented on SPARK-18539: I think we will hit the issue if we use user-specified schema. Here is what I tried in spark-shell built from master branch: {code} val df = spark.range(1).coalesce(1) df.selectExpr("id AS a").write.parquet("/Users/xinwu/spark-test/data/spark-18539") val schema = StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType))) spark.read.option("mergeSchema", "true").schema(schema).parquet("/Users/xinwu/spark-test/data/spark-18539").filter("b < 0").count() {code} The exception is {code} Caused by: java.lang.IllegalArgumentException: Column [b] was not found in schema! at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58) at org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58) at org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:308) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:63) at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) at org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377) {code} Here I have one parquet file missing column b and query with user-specified schema (a, b). > Cannot filter by nonexisting column in parquet file > --- > > Key: SPARK-18539 > URL: https://issues.apache.org/jira/browse/SPARK-18539 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1, 2.0.2 >Reporter: Vitaly Gerasimov >Priority: Critical > > {code} > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.types.DataTypes._ > import org.apache.spark.sql.types.{StructField, StructType} > val sc = SparkSession.builder().config(new > SparkConf().setMaster("local")).getOrCreate() > val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}""")) > sc.read > .schema(StructType(Seq(StructField("a", IntegerType > .json(jsonRDD) > .write > .parquet("/tmp/test") > sc.read > .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", > IntegerType, nullable = true > .load("/tmp/test") > .createOrReplaceTempView("table") > sc.sql("select b from table where b is not null").show() > {code} > returns: > {code} > 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalArgumentException: Column [b] was not found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100) > at >
[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file
[ https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723296#comment-15723296 ] Xin Wu commented on SPARK-18539: Yes. I have the fix and will submit PR and cc everyone for review. > Cannot filter by nonexisting column in parquet file > --- > > Key: SPARK-18539 > URL: https://issues.apache.org/jira/browse/SPARK-18539 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1, 2.0.2 >Reporter: Vitaly Gerasimov >Priority: Critical > > {code} > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.types.DataTypes._ > import org.apache.spark.sql.types.{StructField, StructType} > val sc = SparkSession.builder().config(new > SparkConf().setMaster("local")).getOrCreate() > val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}""")) > sc.read > .schema(StructType(Seq(StructField("a", IntegerType > .json(jsonRDD) > .write > .parquet("/tmp/test") > sc.read > .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", > IntegerType, nullable = true > .load("/tmp/test") > .createOrReplaceTempView("table") > sc.sql("select b from table where b is not null").show() > {code} > returns: > {code} > 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalArgumentException: Column [b] was not found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at
[jira] [Created] (SPARK-17551) support null ordering for DataFrame API
Xin Wu created SPARK-17551: -- Summary: support null ordering for DataFrame API Key: SPARK-17551 URL: https://issues.apache.org/jira/browse/SPARK-17551 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Xin Wu SPARK-10747 has added support for NULLS FIRST | LAST in ORDER BY clause for SQL interface. This JIRA is to complete this feature by adding same support for DataFrame/Dataset APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10747) add support for NULLS FIRST|LAST in ORDER BY clause
[ https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-10747: --- Summary: add support for NULLS FIRST|LAST in ORDER BY clause (was: add support for window specification to include how NULLS are ordered) > add support for NULLS FIRST|LAST in ORDER BY clause > --- > > Key: SPARK-10747 > URL: https://issues.apache.org/jira/browse/SPARK-10747 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell > > You cannot express how NULLS are to be sorted in the window order > specification and have to use a compensating expression to simulate. > Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' > near 'nulls' > line 1:82 missing EOF at 'last' near 'nulls'; > SQLState: null > Same limitation as Hive reported in Apache JIRA HIVE-9535 ) > This fails > select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc > nulls last) from tolap > select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when > c3 is null then 1 else 0 end) from tolap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10747) add support for window specification to include how NULLS are ordered
[ https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440656#comment-15440656 ] Xin Wu commented on SPARK-10747: This JIRA may be changed to support NULLS FIRST|LAST feature in ORDER BY clause. > add support for window specification to include how NULLS are ordered > - > > Key: SPARK-10747 > URL: https://issues.apache.org/jira/browse/SPARK-10747 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell > > You cannot express how NULLS are to be sorted in the window order > specification and have to use a compensating expression to simulate. > Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' > near 'nulls' > line 1:82 missing EOF at 'last' near 'nulls'; > SQLState: null > Same limitation as Hive reported in Apache JIRA HIVE-9535 ) > This fails > select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc > nulls last) from tolap > select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when > c3 is null then 1 else 0 end) from tolap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10747) add support for window specification to include how NULLS are ordered
[ https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-10747: --- Issue Type: New Feature (was: Improvement) > add support for window specification to include how NULLS are ordered > - > > Key: SPARK-10747 > URL: https://issues.apache.org/jira/browse/SPARK-10747 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell > > You cannot express how NULLS are to be sorted in the window order > specification and have to use a compensating expression to simulate. > Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' > near 'nulls' > line 1:82 missing EOF at 'last' near 'nulls'; > SQLState: null > Same limitation as Hive reported in Apache JIRA HIVE-9535 ) > This fails > select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc > nulls last) from tolap > select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when > c3 is null then 1 else 0 end) from tolap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions
[ https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438457#comment-15438457 ] Xin Wu edited comment on SPARK-14927 at 8/26/16 4:46 AM: - [~smilegator] Do you think what you are working on will fix this issue by the way? This is to allow hive to see the partitions created by SparkSQL from a data frame. was (Author: xwu0226): [~smilegator] Do you think what you are working on regarding will fix this issue? This is to allow hive to see the partitions created by SparkSQL from a data frame. > DataFrame. saveAsTable creates RDD partitions but not Hive partitions > - > > Key: SPARK-14927 > URL: https://issues.apache.org/jira/browse/SPARK-14927 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1 > Environment: Mac OS X 10.11.4 local >Reporter: Sasha Ovsankin > > This is a followup to > http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive > . I tried to use suggestions in the answers but couldn't make it to work in > Spark 1.6.1 > I am trying to create partitions programmatically from `DataFrame. Here is > the relevant code (adapted from a Spark test): > hc.setConf("hive.metastore.warehouse.dir", "tmp/tests") > //hc.setConf("hive.exec.dynamic.partition", "true") > //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict") > hc.sql("create database if not exists tmp") > hc.sql("drop table if exists tmp.partitiontest1") > Seq(2012 -> "a").toDF("year", "val") > .write > .partitionBy("year") > .mode(SaveMode.Append) > .saveAsTable("tmp.partitiontest1") > hc.sql("show partitions tmp.partitiontest1").show > Full file is here: > https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a > I get the error that the table is not partitioned: > == > HIVE FAILURE OUTPUT > == > SET hive.support.sql11.reserved.keywords=false > SET hive.metastore.warehouse.dir=tmp/tests > OK > OK > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a > partitioned table > == > It looks like the root cause is that > `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable` > always creates table with empty partitions. > Any help to move this forward is appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions
[ https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438457#comment-15438457 ] Xin Wu commented on SPARK-14927: [~smilegator] Do you think what you are working on regarding will fix this issue? This is to allow hive to see the partitions created by SparkSQL from a data frame. > DataFrame. saveAsTable creates RDD partitions but not Hive partitions > - > > Key: SPARK-14927 > URL: https://issues.apache.org/jira/browse/SPARK-14927 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1 > Environment: Mac OS X 10.11.4 local >Reporter: Sasha Ovsankin > > This is a followup to > http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive > . I tried to use suggestions in the answers but couldn't make it to work in > Spark 1.6.1 > I am trying to create partitions programmatically from `DataFrame. Here is > the relevant code (adapted from a Spark test): > hc.setConf("hive.metastore.warehouse.dir", "tmp/tests") > //hc.setConf("hive.exec.dynamic.partition", "true") > //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict") > hc.sql("create database if not exists tmp") > hc.sql("drop table if exists tmp.partitiontest1") > Seq(2012 -> "a").toDF("year", "val") > .write > .partitionBy("year") > .mode(SaveMode.Append) > .saveAsTable("tmp.partitiontest1") > hc.sql("show partitions tmp.partitiontest1").show > Full file is here: > https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a > I get the error that the table is not partitioned: > == > HIVE FAILURE OUTPUT > == > SET hive.support.sql11.reserved.keywords=false > SET hive.metastore.warehouse.dir=tmp/tests > OK > OK > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a > partitioned table > == > It looks like the root cause is that > `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable` > always creates table with empty partitions. > Any help to move this forward is appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10747) add support for window specification to include how NULLS are ordered
[ https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425547#comment-15425547 ] Xin Wu commented on SPARK-10747: [~hvanhovell] Yes. Since we have native parser now, we can do this within SparkSQL. I can work on this. Thanks! > add support for window specification to include how NULLS are ordered > - > > Key: SPARK-10747 > URL: https://issues.apache.org/jira/browse/SPARK-10747 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell > > You cannot express how NULLS are to be sorted in the window order > specification and have to use a compensating expression to simulate. > Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' > near 'nulls' > line 1:82 missing EOF at 'last' near 'nulls'; > SQLState: null > Same limitation as Hive reported in Apache JIRA HIVE-9535 ) > This fails > select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc > nulls last) from tolap > select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when > c3 is null then 1 else 0 end) from tolap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source
[ https://issues.apache.org/jira/browse/SPARK-16924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-16924: --- Issue Type: Improvement (was: Bug) > DataStreamReader can not support option("inferSchema", true/false) for csv > and json file source > --- > > Key: SPARK-16924 > URL: https://issues.apache.org/jira/browse/SPARK-16924 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > Currently DataStreamReader can not support option("inferSchema", true|false) > for csv and json file source. It only takes SQLConf setting > "spark.sql.streaming.schemaInference", which needs to be set at session > level. > For example: > {code} > scala> val in = spark.readStream.format("json").option("inferSchema", > true).load("/Users/xinwu/spark-test/data/json/t1") > java.lang.IllegalArgumentException: Schema must be specified when creating a > streaming source DataFrame. If some files already exist in the directory, > then depending on the file format you may be able to create a static > DataFrame on that directory with 'spark.read.load(directory)' and infer > schema from it. > at > org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) > at > org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) > ... 48 elided > scala> val in = spark.readStream.format("csv").option("inferSchema", > true).load("/Users/xinwu/spark-test/data/csv") > java.lang.IllegalArgumentException: Schema must be specified when creating a > streaming source DataFrame. If some files already exist in the directory, > then depending on the file format you may be able to create a static > DataFrame on that directory with 'spark.read.load(directory)' and infer > schema from it. > at > org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) > at > org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) > ... 48 elided > {code} > In the example, even though users specify the option("inferSchema", true), it > does not take it. But for batch data, DataFrameReader can take it: > {code} > scala> val in = spark.read.format("csv").option("header", > true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1") > in: org.apache.spark.sql.DataFrame = [signal: string, flash: int] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source
Xin Wu created SPARK-16924: -- Summary: DataStreamReader can not support option("inferSchema", true/false) for csv and json file source Key: SPARK-16924 URL: https://issues.apache.org/jira/browse/SPARK-16924 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xin Wu Currently DataStreamReader can not support option("inferSchema", true|false) for csv and json file source. It only takes SQLConf setting "spark.sql.streaming.schemaInference", which needs to be set at session level. For example: {code} scala> val in = spark.readStream.format("json").option("inferSchema", true).load("/Users/xinwu/spark-test/data/json/t1") java.lang.IllegalArgumentException: Schema must be specified when creating a streaming source DataFrame. If some files already exist in the directory, then depending on the file format you may be able to create a static DataFrame on that directory with 'spark.read.load(directory)' and infer schema from it. at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) ... 48 elided scala> val in = spark.readStream.format("csv").option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv") java.lang.IllegalArgumentException: Schema must be specified when creating a streaming source DataFrame. If some files already exist in the directory, then depending on the file format you may be able to create a static DataFrame on that directory with 'spark.read.load(directory)' and infer schema from it. at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) ... 48 elided {code} In the example, even though users specify the option("inferSchema", true), it does not take it. But for batch data, DataFrameReader can take it: {code} scala> val in = spark.read.format("csv").option("header", true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1") in: org.apache.spark.sql.DataFrame = [signal: string, flash: int] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9761) Inconsistent metadata handling with ALTER TABLE
[ https://issues.apache.org/jira/browse/SPARK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408405#comment-15408405 ] Xin Wu commented on SPARK-9761: --- [~drwinters] Spark 2.0 has support DDL commands, which means it gives the opportunity of implementing the ALTER TABLE ADD/CHANG COLUMNS, that is not supported yet in current released Spark 2.0. Spark 2.1 will have some change also in the native DDL infrastructure. I think once this is settled, it will be easier to support this. I am looking into this also. > Inconsistent metadata handling with ALTER TABLE > --- > > Key: SPARK-9761 > URL: https://issues.apache.org/jira/browse/SPARK-9761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov > Labels: hive, sql > > Schema changes made with {{ALTER TABLE}} are not shown in {{DESCRIBE TABLE}}. > The table in question was created with {{HiveContext.read.json()}}. > Steps: > # {{alter table dimension_components add columns (z string);}} succeeds. > # {{describe dimension_components;}} does not show the new column, even after > restarting spark-sql. > # A second {{alter table dimension_components add columns (z string);}} fails > with RROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: > Duplicate column name: z > Full spark-sql output > [here|https://gist.github.com/ssimeonov/d9af4b8bb76b9d7befde]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16605) Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports
[ https://issues.apache.org/jira/browse/SPARK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383082#comment-15383082 ] Xin Wu edited comment on SPARK-16605 at 7/18/16 9:17 PM: - The current issue for dealing with ORC data inserted by Hive is that the schema stored in orc file inserted by hive is using dummy column name such as "_col1, _col2, ...". Hive knows how to read the data. However, in Spark SQL, for performance gain, it tries to convert ORC table to its native ORC relation for scanning, in that it infers schema from orc file directly but getting the table schema from hive metastore. There are then mismatch here. Try the workaround that turns off this conversion for performance: {code}set spark.sql.hive.convertMetastoreOrc=false{code} Then, see if it works. was (Author: xwu0226): The current issue for dealing with ORC data inserted by Hive is that the schema stored in orc file inserted by hive is using dummy column name such as "_col1, _col2, ...". Hive knows how to read the data. However, in Spark SQL, for performance gain, it tries to convert ORC table to its native ORC relation for scanning, in that it infers schema from orc file directly but getting the table schema from hive megastore. There are then mismatch here. Try the workaround that turns off this conversion for performance: {code}set spark.sql.hive.convertMetastoreOrc=false{code} Then, see if it works. > Spark2.0 cannot "select" data from a table stored as an orc file which has > been created by hive while hive or spark1.6 supports > --- > > Key: SPARK-16605 > URL: https://issues.apache.org/jira/browse/SPARK-16605 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: marymwu > Attachments: screenshot-1.png > > > Spark2.0 cannot "select" data from a table stored as an orc file which has > been created by hive while hive or spark1.6 supports > Steps: > 1. Use hive to create a table "tbtxt" stored as txt and load data into it. > 2. Use hive to create a table "tborc" stored as orc and insert the data from > table "tbtxt" . Example, "create table tborc stored as orc as select * from > tbtxt" > 3. Use spark2.0 to "select * from tborc;".-->error > occurs,java.lang.IllegalArgumentException: Field "nid" does not exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16605) Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports
[ https://issues.apache.org/jira/browse/SPARK-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383082#comment-15383082 ] Xin Wu commented on SPARK-16605: The current issue for dealing with ORC data inserted by Hive is that the schema stored in orc file inserted by hive is using dummy column name such as "_col1, _col2, ...". Hive knows how to read the data. However, in Spark SQL, for performance gain, it tries to convert ORC table to its native ORC relation for scanning, in that it infers schema from orc file directly but getting the table schema from hive megastore. There are then mismatch here. Try the workaround that turns off this conversion for performance: {code}set spark.sql.hive.convertMetastoreOrc=false{code} Then, see if it works. > Spark2.0 cannot "select" data from a table stored as an orc file which has > been created by hive while hive or spark1.6 supports > --- > > Key: SPARK-16605 > URL: https://issues.apache.org/jira/browse/SPARK-16605 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: marymwu > Attachments: screenshot-1.png > > > Spark2.0 cannot "select" data from a table stored as an orc file which has > been created by hive while hive or spark1.6 supports > Steps: > 1. Use hive to create a table "tbtxt" stored as txt and load data into it. > 2. Use hive to create a table "tborc" stored as orc and insert the data from > table "tbtxt" . Example, "create table tborc stored as orc as select * from > tbtxt" > 3. Use spark2.0 to "select * from tborc;".-->error > occurs,java.lang.IllegalArgumentException: Field "nid" does not exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15970) WARNing message related to persisting table to Hive Megastore while Spark SQL is running in-memory catalog mode
Xin Wu created SPARK-15970: -- Summary: WARNing message related to persisting table to Hive Megastore while Spark SQL is running in-memory catalog mode Key: SPARK-15970 URL: https://issues.apache.org/jira/browse/SPARK-15970 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xin Wu Priority: Minor When we run Spark-shell in In-Memory catalog mode, creating a datasource table that is not compatible with hive will show a warning messaging saying it can not persist the table in hive compatible way. However, In-Memory catalog mode should not involve in trying to persist table in hive megastore at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313257#comment-15313257 ] Xin Wu edited comment on SPARK-15705 at 6/2/16 11:15 PM: - I can recreate it now. and will look into it. This is different issue than SPARK-14959 was (Author: xwu0226): I can recreate it now. and will look into it. > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313257#comment-15313257 ] Xin Wu commented on SPARK-15705: I can recreate it now. and will look into it. > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15710) Exception with WHERE clause in SQL for non-default Hive database
[ https://issues.apache.org/jira/browse/SPARK-15710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313193#comment-15313193 ] Xin Wu commented on SPARK-15710: hmm.. after another rebase of the master. it seems that the problem is gone, even for pyspark: {code} >>> spark.sql("CREATE DATABASE IF NOT EXISTS test2") 16/06/02 15:16:10 WARN ObjectStore: Failed to get database test2, returning NoSuchObjectException DataFrame[] >>> spark.sql("USE test2") DataFrame[] >>> df = spark.createDataFrame([ ... (0, "a", 10), ... (1, "b", 11), ... (2, "c", 12), ... (3, "a", 14), ... (4, "a", 17), ... (5, "c", 18) ... ], ["id", "category", "age"]) >>> df.write.saveAsTable('test6', mode='overwrite') Jun 2, 2016 3:14:01 PM WARNING: org.apache.parquet.hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,000 bytes) of heap memory Scaling row group sizes to 96.54% for 7 writers Jun 2, 2016 3:16:43 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parq16/06/02 15:16:43 WARN HiveMetaStore: Location: file:/Users/xinwu/spark/spark-warehouse/test2.db/test6 specified for non-external table:test6 >>> spark.sql("SELECT * FROM test6 WHERE id = 2").take(1) [Row(id=2, category=u'c', age=12)] >>> spark.sql("SELECT * FROM test6 WHERE id = 2").show() +---++---+ | id|category|age| +---++---+ | 2| c| 12| +---++---+ {code} > Exception with WHERE clause in SQL for non-default Hive database > > > Key: SPARK-15710 > URL: https://issues.apache.org/jira/browse/SPARK-15710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: databricks community edition 2.0 preview >Reporter: Igor Fridman > > The following code throws an exception only with non-default database. If I > use 'default' database it works. > {code} > spark.sql("CREATE DATABASE IF NOT EXISTS test") > spark.sql("USE test") > df = spark.createDataFrame([ > (0, "a", 10), > (1, "b", 11), > (2, "c", 12), > (3, "a", 14), > (4, "a", 17), > (5, "c", 18) > ], ["id", "category", "age"]) > df.write.saveAsTable('test', mode='overwrite') > spark.sql("SELECT * FROM test WHERE id = 2").take(1) > {code} > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > 13 df.write.saveAsTable('test', mode='overwrite') > 14 > ---> 15 spark.sql("SELECT * FROM test WHERE id = 2").take(1) > /databricks/spark/python/pyspark/sql/dataframe.py in take(self, num) > 333 with SCCallSiteSync(self._sc) as css: > 334 port = > self._sc._jvm.org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe( > --> 335 self._jdf, num) > 336 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 337 > /databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 931 answer = self.gateway_client.send_command(command) > 932 return_value = get_return_value( > --> 933 answer, self.gateway_client, self.target_id, self.name) > 934 > 935 for temp_arg in temp_args: > /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 310 raise Py4JJavaError( > 311 "An error occurred while calling {0}{1}{2}.\n". > --> 312 format(target_id, ".", name), value) > 313 else: > 314 raise Py4JError( > Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe. > : java.lang.ClassNotFoundException: > org.apache.parquet.filter2.predicate.ValidTypeMap$FullTypeDescriptor > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:264) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.relaxParquetValidTypeMap$lzycompute(ParquetFilters.scala:321) > at >
[jira] [Commented] (SPARK-15710) Exception with WHERE clause in SQL for non-default Hive database
[ https://issues.apache.org/jira/browse/SPARK-15710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312851#comment-15312851 ] Xin Wu commented on SPARK-15710: I see. pyspark does not work. > Exception with WHERE clause in SQL for non-default Hive database > > > Key: SPARK-15710 > URL: https://issues.apache.org/jira/browse/SPARK-15710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: databricks community edition 2.0 preview >Reporter: Igor Fridman > > The following code throws an exception only with non-default database. If I > use 'default' database it works. > {code} > spark.sql("CREATE DATABASE IF NOT EXISTS test") > spark.sql("USE test") > df = spark.createDataFrame([ > (0, "a", 10), > (1, "b", 11), > (2, "c", 12), > (3, "a", 14), > (4, "a", 17), > (5, "c", 18) > ], ["id", "category", "age"]) > df.write.saveAsTable('test', mode='overwrite') > spark.sql("SELECT * FROM test WHERE id = 2").take(1) > {code} > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > 13 df.write.saveAsTable('test', mode='overwrite') > 14 > ---> 15 spark.sql("SELECT * FROM test WHERE id = 2").take(1) > /databricks/spark/python/pyspark/sql/dataframe.py in take(self, num) > 333 with SCCallSiteSync(self._sc) as css: > 334 port = > self._sc._jvm.org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe( > --> 335 self._jdf, num) > 336 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 337 > /databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 931 answer = self.gateway_client.send_command(command) > 932 return_value = get_return_value( > --> 933 answer, self.gateway_client, self.target_id, self.name) > 934 > 935 for temp_arg in temp_args: > /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 310 raise Py4JJavaError( > 311 "An error occurred while calling {0}{1}{2}.\n". > --> 312 format(target_id, ".", name), value) > 313 else: > 314 raise Py4JError( > Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe. > : java.lang.ClassNotFoundException: > org.apache.parquet.filter2.predicate.ValidTypeMap$FullTypeDescriptor > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:264) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.relaxParquetValidTypeMap$lzycompute(ParquetFilters.scala:321) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.relaxParquetValidTypeMap(ParquetFilters.scala:319) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.createFilter(ParquetFilters.scala:231) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$7.apply(ParquetFileFormat.scala:309) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$7.apply(ParquetFileFormat.scala:309) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReader(ParquetFileFormat.scala:309) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReaderWithPartitionValues(ParquetFileFormat.scala:268) > at > org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:112)
[jira] [Commented] (SPARK-15710) Exception with WHERE clause in SQL for non-default Hive database
[ https://issues.apache.org/jira/browse/SPARK-15710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312836#comment-15312836 ] Xin Wu commented on SPARK-15710: Hmm. I can not recreate it on latest master branch. Here is my steps: {code} scala> spark.sql("create database if not exists test") res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("use test") res1: org.apache.spark.sql.DataFrame = [] scala> case class AgeData(id: Int, category: String, age: Int) defined class AgeData scala> val ds = spark.createDataFrame( Seq(AgeData(0, "a", 10), AgeData(1, "b", 11), AgeData(2, "c", 12))) ds: org.apache.spark.sql.DataFrame = [id: int, category: string ... 1 more field] scala> ds.show +---++---+ | id|category|age| +---++---+ | 0| a| 10| | 1| b| 11| | 2| c| 12| +---++---+ scala> ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).saveAsTable("test2") scala> spark.sql("select * from test2 where id = 2").show +---++---+ | id|category|age| +---++---+ | 2| c| 12| +---++---+ scala> spark.sql("select * from test2 where id = 2").take(1) res9: Array[org.apache.spark.sql.Row] = Array([2,c,12]) {code} > Exception with WHERE clause in SQL for non-default Hive database > > > Key: SPARK-15710 > URL: https://issues.apache.org/jira/browse/SPARK-15710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: databricks community edition 2.0 preview >Reporter: Igor Fridman > > The following code throws an exception only with non-default database. If I > use 'default' database it works. > {code} > spark.sql("CREATE DATABASE IF NOT EXISTS test") > spark.sql("USE test") > df = spark.createDataFrame([ > (0, "a", 10), > (1, "b", 11), > (2, "c", 12), > (3, "a", 14), > (4, "a", 17), > (5, "c", 18) > ], ["id", "category", "age"]) > df.write.saveAsTable('test', mode='overwrite') > spark.sql("SELECT * FROM test WHERE id = 2").take(1) > {code} > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > 13 df.write.saveAsTable('test', mode='overwrite') > 14 > ---> 15 spark.sql("SELECT * FROM test WHERE id = 2").take(1) > /databricks/spark/python/pyspark/sql/dataframe.py in take(self, num) > 333 with SCCallSiteSync(self._sc) as css: > 334 port = > self._sc._jvm.org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe( > --> 335 self._jdf, num) > 336 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 337 > /databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 931 answer = self.gateway_client.send_command(command) > 932 return_value = get_return_value( > --> 933 answer, self.gateway_client, self.target_id, self.name) > 934 > 935 for temp_arg in temp_args: > /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 310 raise Py4JJavaError( > 311 "An error occurred while calling {0}{1}{2}.\n". > --> 312 format(target_id, ".", name), value) > 313 else: > 314 raise Py4JError( > Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe. > : java.lang.ClassNotFoundException: > org.apache.parquet.filter2.predicate.ValidTypeMap$FullTypeDescriptor > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:264) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.relaxParquetValidTypeMap$lzycompute(ParquetFilters.scala:321) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.relaxParquetValidTypeMap(ParquetFilters.scala:319) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$.createFilter(ParquetFilters.scala:231) > at >
[jira] [Commented] (SPARK-14959) Problem Reading partitioned ORC or Parquet files
[ https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311409#comment-15311409 ] Xin Wu commented on SPARK-14959: I can recreate the problem with hdfs location. and I have a patch for it now. I will submit a PR soon. The actual results now is following, as expected: {code} scala> spark.read.format("parquet").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_part").show +-+---+ | text| id| +-+---+ |hello| 0| |world| 0| |hello| 1| |there| 1| +-+---+ spark.read.format("orc").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_orc").show +-+---+ | text| id| +-+---+ |hello| 0| |world| 0| |hello| 1| |there| 1| +-+---+ {code} > Problem Reading partitioned ORC or Parquet files > - > > Key: SPARK-14959 > URL: https://issues.apache.org/jira/browse/SPARK-14959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4) >Reporter: Sebastian YEPES FERNANDEZ >Priority: Blocker > > Hello, > I have noticed that in the pasts days there is an issue when trying to read > partitioned files from HDFS. > I am running on Spark master branch #c544356 > The write actually works but the read fails. > {code:title=Issue Reproduction} > case class Data(id: Int, text: String) > val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, > "world"), Data(1, "there")) ) > scala> > ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > java.io.FileNotFoundException: Path is not a file: > /user/spark/test.parquet/id=0 > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227) > at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:228) > at >
[jira] [Updated] (SPARK-15681) Allow case-insensitiveness in sc.setLogLevel
[ https://issues.apache.org/jira/browse/SPARK-15681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-15681: --- Description: Currently SparkContext API setLogLevel(level: String) can not handle lower case or mixed case input string. But org.apache.log4j.Level.toLevel can take lowercase or mixed case. was: Currently SparkContext API setLogLevel(level: String) can not handle lower case or mixed case input string. But org.apache.log4j.Level.toLevel can take lowercase or mixed case. Also resetLogLevel to original configuration could be helpful for users to switch log level for different diagnostic purposes. > Allow case-insensitiveness in sc.setLogLevel > > > Key: SPARK-15681 > URL: https://issues.apache.org/jira/browse/SPARK-15681 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Xin Wu >Priority: Minor > > Currently SparkContext API setLogLevel(level: String) can not handle lower > case or mixed case input string. But org.apache.log4j.Level.toLevel can take > lowercase or mixed case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15681) Allow case-insensitiveness in sc.setLogLevel
[ https://issues.apache.org/jira/browse/SPARK-15681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-15681: --- Summary: Allow case-insensitiveness in sc.setLogLevel (was: Allow case-insensitiveness in sc.setLogLevel and support sc.resetLogLevel) > Allow case-insensitiveness in sc.setLogLevel > > > Key: SPARK-15681 > URL: https://issues.apache.org/jira/browse/SPARK-15681 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Xin Wu >Priority: Minor > > Currently SparkContext API setLogLevel(level: String) can not handle lower > case or mixed case input string. But org.apache.log4j.Level.toLevel can take > lowercase or mixed case. > Also resetLogLevel to original configuration could be helpful for users to > switch log level for different diagnostic purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15681) Allow case-insensitiveness in sc.setLogLevel and support sc.resetLogLevel
Xin Wu created SPARK-15681: -- Summary: Allow case-insensitiveness in sc.setLogLevel and support sc.resetLogLevel Key: SPARK-15681 URL: https://issues.apache.org/jira/browse/SPARK-15681 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Xin Wu Currently SparkContext API setLogLevel(level: String) can not handle lower case or mixed case input string. But org.apache.log4j.Level.toLevel can take lowercase or mixed case. Also resetLogLevel to original configuration could be helpful for users to switch log level for different diagnostic purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14361) Support EXCLUDE clause in Window function framing
[ https://issues.apache.org/jira/browse/SPARK-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-14361: --- Issue Type: New Feature (was: Improvement) > Support EXCLUDE clause in Window function framing > - > > Key: SPARK-14361 > URL: https://issues.apache.org/jira/browse/SPARK-14361 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > The current Spark SQL does not support the exclusion clause in Window > function framing, which is part of ANSI SQL2003’s Window syntax. For example, > IBM Netezza fully supports it as shown in the > https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html). > We propose to implement it in this JIRA.. > The ANSI SQL2003's Window Syntax: > {code} > FUNCTION_NAME(expr) OVER {window_name | (window_specification)} > window_specification ::= [window_name] [partitioning] [ordering] [framing] > partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name] > ordering ::= ORDER [SIBLINGS] BY rule[, rule...] > rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}] > framing ::= {ROWS | RANGE} {start | between} [exclusion] > start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW} > between ::= BETWEEN bound AND bound > bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING} > exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE > NO OTHERS} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in RDD[String]
[ https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299167#comment-15299167 ] Xin Wu commented on SPARK-15463: I am looking into this. > Support for creating a dataframe from CSV in RDD[String] > > > Key: SPARK-15463 > URL: https://issues.apache.org/jira/browse/SPARK-15463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: PJ Fanning > > I currently use Databrick's spark-csv lib but some features don't work with > Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV > support into spark-sql directly, that spark-csv won't be modified. > I currently read some CSV data that has been pre-processed and is in > RDD[String] format. > There is sqlContext.read.json(rdd: RDD[String]) but other formats don't > appear to support the creation of DataFrames based on loading from > RDD[String]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15431) Support LIST FILE(s)|JAR(s) command natively
Xin Wu created SPARK-15431: -- Summary: Support LIST FILE(s)|JAR(s) command natively Key: SPARK-15431 URL: https://issues.apache.org/jira/browse/SPARK-15431 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xin Wu Currently command "ADD FILE|JAR" is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by "LIST FILE(s)|JAR(s)" command because the LIST command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15236) No way to disable Hive support in REPL
[ https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282297#comment-15282297 ] Xin Wu commented on SPARK-15236: i am looking into this > No way to disable Hive support in REPL > -- > > Key: SPARK-15236 > URL: https://issues.apache.org/jira/browse/SPARK-15236 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or > > If you built Spark with Hive classes, there's no switch to flip to start a > new `spark-shell` using the InMemoryCatalog. The only thing you can do now is > to rebuild Spark again. That is quite inconvenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15269) Creating external table leaves empty directory under warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282215#comment-15282215 ] Xin Wu edited comment on SPARK-15269 at 5/12/16 11:04 PM: -- FYI.. The reason why the default database paths obtained by different ways are different as mentioned above, is that I have an older metastore_db in my SPARK_HOME, where the metastore database keeps the old hive.metastore.warehouse.dir value (/user/hive/warehouse). After I removed this metastore_db, I get the database path consistent now. Testing the fix for #2 now. Will submit PR soon. was (Author: xwu0226): FYI.. The reason why the default database paths obtained by different ways are different as mentioned above, is that I have an older metastore_db in my SPARK_HOME, where the metastore database keeps the old hive.metastore.warehouse.dir value (/user/hive/warehouse). After I removed this metastore_db, I get the database path consistent now. > Creating external table leaves empty directory under warehouse directory > > > Key: SPARK-15269 > URL: https://issues.apache.org/jira/browse/SPARK-15269 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Adding the following test case in {{HiveDDLSuite}} may reproduce this issue: > {code} > test("foo") { > withTempPath { dir => > val path = dir.getCanonicalPath > spark.range(1).write.json(path) > withTable("ddl_test1") { > sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')") > sql("DROP TABLE ddl_test1") > sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a") > } > } > } > {code} > Note that the first {{CREATE TABLE}} command creates an external table since > data source tables are always external when {{PATH}} option is specified. > When executing the second {{CREATE TABLE}} command, which creates a managed > table with the same name, it fails because there's already an unexpected > directory with the same name as the table name in the warehouse directory: > {noformat} > [info] - foo *** FAILED *** (7 seconds, 649 milliseconds) > [info] org.apache.spark.sql.AnalysisException: path > file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1 > already exists.; > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > [info] at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417) > [info] at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > [info] at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
[jira] [Commented] (SPARK-15269) Creating external table leaves empty directory under warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282215#comment-15282215 ] Xin Wu commented on SPARK-15269: FYI.. The reason why the default database paths obtained by different ways are different as mentioned above, is that I have an older metastore_db in my SPARK_HOME, where the metastore database keeps the old hive.metastore.warehouse.dir value (/user/hive/warehouse). After I removed this metastore_db, I get the database path consistent now. > Creating external table leaves empty directory under warehouse directory > > > Key: SPARK-15269 > URL: https://issues.apache.org/jira/browse/SPARK-15269 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Adding the following test case in {{HiveDDLSuite}} may reproduce this issue: > {code} > test("foo") { > withTempPath { dir => > val path = dir.getCanonicalPath > spark.range(1).write.json(path) > withTable("ddl_test1") { > sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')") > sql("DROP TABLE ddl_test1") > sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a") > } > } > } > {code} > Note that the first {{CREATE TABLE}} command creates an external table since > data source tables are always external when {{PATH}} option is specified. > When executing the second {{CREATE TABLE}} command, which creates a managed > table with the same name, it fails because there's already an unexpected > directory with the same name as the table name in the warehouse directory: > {noformat} > [info] - foo *** FAILED *** (7 seconds, 649 milliseconds) > [info] org.apache.spark.sql.AnalysisException: path > file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1 > already exists.; > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > [info] at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417) > [info] at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > [info] at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > [info] at org.apache.spark.sql.Dataset.(Dataset.scala:186) > [info] at org.apache.spark.sql.Dataset.(Dataset.scala:167) > [info] at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62) > [info] at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541) >
[jira] [Commented] (SPARK-15269) Creating external table leaves empty directory under warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281748#comment-15281748 ] Xin Wu commented on SPARK-15269: Yes, I can . Thanks! > Creating external table leaves empty directory under warehouse directory > > > Key: SPARK-15269 > URL: https://issues.apache.org/jira/browse/SPARK-15269 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Adding the following test case in {{HiveDDLSuite}} may reproduce this issue: > {code} > test("foo") { > withTempPath { dir => > val path = dir.getCanonicalPath > spark.range(1).write.json(path) > withTable("ddl_test1") { > sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')") > sql("DROP TABLE ddl_test1") > sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a") > } > } > } > {code} > Note that the first {{CREATE TABLE}} command creates an external table since > data source tables are always external when {{PATH}} option is specified. > When executing the second {{CREATE TABLE}} command, which creates a managed > table with the same name, it fails because there's already an unexpected > directory with the same name as the table name in the warehouse directory: > {noformat} > [info] - foo *** FAILED *** (7 seconds, 649 milliseconds) > [info] org.apache.spark.sql.AnalysisException: path > file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1 > already exists.; > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > [info] at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417) > [info] at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > [info] at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > [info] at org.apache.spark.sql.Dataset.(Dataset.scala:186) > [info] at org.apache.spark.sql.Dataset.(Dataset.scala:167) > [info] at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62) > [info] at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541) > [info] at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:59) > [info] at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:59) > [info] at >
[jira] [Comment Edited] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281621#comment-15281621 ] Xin Wu edited comment on SPARK-15269 at 5/12/16 3:37 PM: - For the case where we can not recreate this issue, it is because the default database path we got at {code}if (!new CaseInsensitiveMap(options).contains("path")) { isExternal = false options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent)) } else { options }{code} is different from hive metastore's default warehouse dir. They are "/user/hive/warehouse" and "/spark-warehouse", respectively. When creating the first table, hive metastore's default warehouse dir is "/spark-warehouse", while when creating the second table without PATH option, the sessionState.catalog.defaultTablePath returns "/user/hive/warehouse". Therefore, the 2nd table creation will not hit the issue. But the first table still leave the empty table directory behind after being dropped. Two questions: 1. Should we keep these 2 default database path consistent? 2. If they are consistent, we will hit the issue reported in this JIRA.. Then, can we also assign the provided path to the CatalogTable.storage.locationURI, even though newSparkSQLSpecificMetastoreTable is called in createDataSourceTables for a non-hive compatible metastore table? This will avoid leaving hive metastore to pick the default path for the table. was (Author: xwu0226): For the case where we can not recreate this issue, it is because the default database path we got at {code}if (!new CaseInsensitiveMap(options).contains("path")) { isExternal = false options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent)) } else { options }{code} is different from hive metastore's default warehouse dir. They are "/user/hive/warehouse" and "/spark-warehouse", respectively. When creating the first table, hive metastore's default warehouse dir is "/spark-warehouse", while when creating the second table without PATH option, the sessionState.catalog.defaultTablePath returns "/user/hive/warehouse". Therefore, the 2nd table creation will not hit the issue. But the first table still leave the empty table directory behind after being dropped. Two questions: 1. Should we keep these 2 default database path consistent? 2. If they are consistent, we will hit the issue reported in this JIRA.. Then, can we also assign the provided path to the CatalogTable.storage.locationURI, even though newSparkSQLSpecificMetastoreTable is called in createDataSourceTables for a non-hive compatible metastore table? > Creating external table in test code leaves empty directory under warehouse > directory > - > > Key: SPARK-15269 > URL: https://issues.apache.org/jira/browse/SPARK-15269 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > It seems that this issue doesn't affect production code. I couldn't reproduce > it using Spark shell. > Adding the following test case in {{HiveDDLSuite}} may reproduce this issue: > {code} > test("foo") { > withTempPath { dir => > val path = dir.getCanonicalPath > spark.range(1).write.json(path) > withTable("ddl_test1") { > sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')") > sql("DROP TABLE ddl_test1") > sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a") > } > } > } > {code} > Note that the first {{CREATE TABLE}} command creates an external table since > data source tables are always external when {{PATH}} option is specified. > When executing the second {{CREATE TABLE}} command, which creates a managed > table with the same name, it fails because there's already an unexpected > directory with the same name as the table name in the warehouse directory: > {noformat} > [info] - foo *** FAILED *** (7 seconds, 649 milliseconds) > [info] org.apache.spark.sql.AnalysisException: path > file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1 > already exists.; > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) >
[jira] [Commented] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281621#comment-15281621 ] Xin Wu commented on SPARK-15269: For the case where we can not recreate this issue, it is because the default database path we got at {code}if (!new CaseInsensitiveMap(options).contains("path")) { isExternal = false options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent)) } else { options }{code} is different from hive metastore's default warehouse dir. They are "/user/hive/warehouse" and "/spark-warehouse", respectively. When creating the first table, hive metastore's default warehouse dir is "/spark-warehouse", while when creating the second table without PATH option, the sessionState.catalog.defaultTablePath returns "/user/hive/warehouse". Therefore, the 2nd table creation will not hit the issue. But the first table still leave the empty table directory behind after being dropped. Two questions: 1. Should we keep these 2 default database path consistent? 2. If they are consistent, we will hit the issue reported in this JIRA.. Then, can we also assign the provided path to the CatalogTable.storage.locationURI, even though newSparkSQLSpecificMetastoreTable is called in createDataSourceTables for a non-hive compatible metastore table? > Creating external table in test code leaves empty directory under warehouse > directory > - > > Key: SPARK-15269 > URL: https://issues.apache.org/jira/browse/SPARK-15269 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > It seems that this issue doesn't affect production code. I couldn't reproduce > it using Spark shell. > Adding the following test case in {{HiveDDLSuite}} may reproduce this issue: > {code} > test("foo") { > withTempPath { dir => > val path = dir.getCanonicalPath > spark.range(1).write.json(path) > withTable("ddl_test1") { > sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')") > sql("DROP TABLE ddl_test1") > sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a") > } > } > } > {code} > Note that the first {{CREATE TABLE}} command creates an external table since > data source tables are always external when {{PATH}} option is specified. > When executing the second {{CREATE TABLE}} command, which creates a managed > table with the same name, it fails because there's already an unexpected > directory with the same name as the table name in the warehouse directory: > {noformat} > [info] - foo *** FAILED *** (7 seconds, 649 milliseconds) > [info] org.apache.spark.sql.AnalysisException: path > file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1 > already exists.; > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > [info] at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417) > [info] at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at >
[jira] [Comment Edited] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280877#comment-15280877 ] Xin Wu edited comment on SPARK-15269 at 5/11/16 10:13 PM: -- The root cause maybe the following? When the first table is created as external table with the data source path, but as json, createDataSourceTables considers it as non-hive compatible table because json is not a Hive SerDe. Then, newSparkSQLSpecificMetastoreTable is invoked to create the CatalogTable before asking HiveClient to create the metastore table. In this call, locationURI is not set. So when we convert CatalogTable to HiveTable before passing to Hive Metastore, hive table's data location is not set. Then, Hive metastore implicitly creates a data location as /tableName, which is {code}/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1{code} in this JIRA. I also verified that creating an external directly in Hive shell without a path will result in a default table directory created by hive. Then, even after dropping table, hive will not delete this stealth directory because the table is external. when we create the 2nd table with select and without a path, the table is created as managed table, provided a default path in the options: {code}val optionsWithPath = if (!new CaseInsensitiveMap(options).contains("path")) { isExternal = false options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent)) } else { options }{code} This default path happens to be the hive's warehouse directory + the table name, which is the same as the one hive metastore implicitly created earlier for the 1st table. So when trying to write the provided data to this data source table by {code} val plan = InsertIntoHadoopFsRelation( outputPath, partitionColumns.map(UnresolvedAttribute.quoted), bucketSpec, format, () => Unit, // No existing table needs to be refreshed. options, data.logicalPlan, mode){code}, InsertIntoHadoopFsRelation complains about the path existence since the SaveMode is SaveMode.ErrorIfExists. was (Author: xwu0226): The root cause maybe the following? When the first table is created as external table with the data source path, but as json, createDataSourceTables considers it as non-hive compatible table because json is not a Hive SerDe. Then, newSparkSQLSpecificMetastoreTable is invoked to create the CatalogTable before asking HiveClient to create the metastore table. In this call, locationURI is not set. So when we convert CatalogTable to HiveTable before passing to Hive Metastore, hive table's data location is not set. Then, Hive metastore implicitly creates a data location as /tableName, which is {code}/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1{code} in this JIRA. I also verified that creating an external directly in Hive shell without a path will result in a default table directory created by hive. Then, even after dropping table, hive will not delete this stealth directory because the table is external. > Creating external table in test code leaves empty directory under warehouse > directory > - > > Key: SPARK-15269 > URL: https://issues.apache.org/jira/browse/SPARK-15269 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > It seems that this issue doesn't affect production code. I couldn't reproduce > it using Spark shell. > Adding the following test case in {{HiveDDLSuite}} may reproduce this issue: > {code} > test("foo") { > withTempPath { dir => > val path = dir.getCanonicalPath > spark.range(1).write.json(path) > withTable("ddl_test1") { > sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')") > sql("DROP TABLE ddl_test1") > sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a") > } > } > } > {code} > Note that the first {{CREATE TABLE}} command creates an external table since > data source tables are always external when {{PATH}} option is specified. > When executing the second {{CREATE TABLE}} command, which creates a managed > table with the same name, it fails because there's already an unexpected > directory with the same name as the table name in the warehouse directory: > {noformat} > [info] - foo *** FAILED *** (7 seconds, 649 milliseconds) > [info] org.apache.spark.sql.AnalysisException: path >
[jira] [Commented] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280888#comment-15280888 ] Xin Wu commented on SPARK-15269: In spark-shell, I can recreate it as following: {code} scala> spark.range(1).write.json("/home/xwu0226/spark-test/data/spark-15269") Datasource.write -> Path: file:/home/xwu0226/spark-test/data/spark-15269 scala> spark.sql("create table spark_15269 using json options(PATH '/home/xwu0226/spark-test/data/spark-15269')") 16/05/11 14:51:00 WARN CreateDataSourceTableUtils: Couldn't find corresponding Hive SerDe for data source provider json. Persisting data source relation `spark_15269` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. going through newSparkSQLSpecificMetastoreTable() res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("drop table spark_15269") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("create table spark_15269 using json as select 1 as a") org.apache.spark.sql.AnalysisException: path file:/user/hive/warehouse/spark_15269 already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:418) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:229) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) at org.apache.spark.sql.Dataset.(Dataset.scala:186) at org.apache.spark.sql.Dataset.(Dataset.scala:167) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541) ... 48 elided {code} > Creating external table in test code leaves empty directory under warehouse > directory > - > > Key: SPARK-15269 > URL: https://issues.apache.org/jira/browse/SPARK-15269 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > It seems that this issue doesn't affect production code. I couldn't reproduce > it using Spark shell. > Adding the following test case in {{HiveDDLSuite}} may reproduce this issue: > {code} > test("foo") { > withTempPath { dir => > val path = dir.getCanonicalPath > spark.range(1).write.json(path) > withTable("ddl_test1") { > sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')") > sql("DROP TABLE ddl_test1") > sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a") > } > } > } > {code} > Note that the first {{CREATE TABLE}} command creates an external table since > data source tables are always external when {{PATH}} option is
[jira] [Comment Edited] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280877#comment-15280877 ] Xin Wu edited comment on SPARK-15269 at 5/11/16 9:47 PM: - The root cause maybe the following? When the first table is created as external table with the data source path, but as json, createDataSourceTables considers it as non-hive compatible table because json is not a Hive SerDe. Then, newSparkSQLSpecificMetastoreTable is invoked to create the CatalogTable before asking HiveClient to create the metastore table. In this call, locationURI is not set. So when we convert CatalogTable to HiveTable before passing to Hive Metastore, hive table's data location is not set. Then, Hive metastore implicitly creates a data location as /tableName, which is {code}/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1{code} in this JIRA. I also verified that creating an external directly in Hive shell without a path will result in a default table directory created by hive. Then, even after dropping table, hive will not delete this stealth directory because the table is external. was (Author: xwu0226): The root cause maybe the following? When the first table is created as external table with the data source path, but as `json`, `createDataSourceTables` considers it as non-hive compatible table because `json` is not a Hive SerDe. Then, `newSparkSQLSpecificMetastoreTable` is invoked to create the `CatalogTable` before asking `HiveClient` to create the metastore table. In this call, `locationURI` is not set. So when we convert CatalogTable to HiveTable before passing to Hive Metastore, hive table's data location is not set. Then, Hive metastore implicitly creates a data location as `/tableName`, which `/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1` in this JIRA. I also verified that creating an external directly in Hive shell without a path will result in a default table directory created by hive. Then, even after dropping table, hive will not delete this stealth directory because the table is external. > Creating external table in test code leaves empty directory under warehouse > directory > - > > Key: SPARK-15269 > URL: https://issues.apache.org/jira/browse/SPARK-15269 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > It seems that this issue doesn't affect production code. I couldn't reproduce > it using Spark shell. > Adding the following test case in {{HiveDDLSuite}} may reproduce this issue: > {code} > test("foo") { > withTempPath { dir => > val path = dir.getCanonicalPath > spark.range(1).write.json(path) > withTable("ddl_test1") { > sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')") > sql("DROP TABLE ddl_test1") > sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a") > } > } > } > {code} > Note that the first {{CREATE TABLE}} command creates an external table since > data source tables are always external when {{PATH}} option is specified. > When executing the second {{CREATE TABLE}} command, which creates a managed > table with the same name, it fails because there's already an unexpected > directory with the same name as the table name in the warehouse directory: > {noformat} > [info] - foo *** FAILED *** (7 seconds, 649 milliseconds) > [info] org.apache.spark.sql.AnalysisException: path > file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1 > already exists.; > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > [info] at >
[jira] [Commented] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280877#comment-15280877 ] Xin Wu commented on SPARK-15269: The root cause maybe the following? When the first table is created as external table with the data source path, but as `json`, `createDataSourceTables` considers it as non-hive compatible table because `json` is not a Hive SerDe. Then, `newSparkSQLSpecificMetastoreTable` is invoked to create the `CatalogTable` before asking `HiveClient` to create the metastore table. In this call, `locationURI` is not set. So when we convert CatalogTable to HiveTable before passing to Hive Metastore, hive table's data location is not set. Then, Hive metastore implicitly creates a data location as `/tableName`, which `/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1` in this JIRA. I also verified that creating an external directly in Hive shell without a path will result in a default table directory created by hive. Then, even after dropping table, hive will not delete this stealth directory because the table is external. > Creating external table in test code leaves empty directory under warehouse > directory > - > > Key: SPARK-15269 > URL: https://issues.apache.org/jira/browse/SPARK-15269 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > It seems that this issue doesn't affect production code. I couldn't reproduce > it using Spark shell. > Adding the following test case in {{HiveDDLSuite}} may reproduce this issue: > {code} > test("foo") { > withTempPath { dir => > val path = dir.getCanonicalPath > spark.range(1).write.json(path) > withTable("ddl_test1") { > sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')") > sql("DROP TABLE ddl_test1") > sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a") > } > } > } > {code} > Note that the first {{CREATE TABLE}} command creates an external table since > data source tables are always external when {{PATH}} option is specified. > When executing the second {{CREATE TABLE}} command, which creates a managed > table with the same name, it fails because there's already an unexpected > directory with the same name as the table name in the warehouse directory: > {noformat} > [info] - foo *** FAILED *** (7 seconds, 649 milliseconds) > [info] org.apache.spark.sql.AnalysisException: path > file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1 > already exists.; > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > [info] at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417) > [info] at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at >
[jira] [Updated] (SPARK-15206) Add testcases for Distinct Aggregation in Having clause
[ https://issues.apache.org/jira/browse/SPARK-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-15206: --- Issue Type: Test (was: Bug) > Add testcases for Distinct Aggregation in Having clause > --- > > Key: SPARK-15206 > URL: https://issues.apache.org/jira/browse/SPARK-15206 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > This is the followup jira for https://github.com/apache/spark/pull/12974. We > will add test cases for including distinct aggregate function in having > clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15206) Add testcases for Distinct Aggregation in Having clause
Xin Wu created SPARK-15206: -- Summary: Add testcases for Distinct Aggregation in Having clause Key: SPARK-15206 URL: https://issues.apache.org/jira/browse/SPARK-15206 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xin Wu This is the followup jira for https://github.com/apache/spark/pull/12974. We will add test cases for including distinct aggregate function in having clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14495) Distinct aggregation cannot be used in the having clause
[ https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271956#comment-15271956 ] Xin Wu edited comment on SPARK-14495 at 5/5/16 6:25 AM: [~smilegator] I got the fix and running regtest now. Will submit the PR once it is done. was (Author: xwu0226): [~smilegator] I got the fix and running regtest now. Will submit the PR one it is done. > Distinct aggregation cannot be used in the having clause > > > Key: SPARK-14495 > URL: https://issues.apache.org/jira/browse/SPARK-14495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yin Huai > > {code} > select date, count(distinct id) > from (select '2010-01-01' as date, 1 as id) tmp > group by date > having count(distinct id) > 0; > org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 > missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if > ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], > [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14495) Distinct aggregation cannot be used in the having clause
[ https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271956#comment-15271956 ] Xin Wu commented on SPARK-14495: [~smilegator] I got the fix and running regtest now. Will submit the PR one it is done. > Distinct aggregation cannot be used in the having clause > > > Key: SPARK-14495 > URL: https://issues.apache.org/jira/browse/SPARK-14495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yin Huai > > {code} > select date, count(distinct id) > from (select '2010-01-01' as date, 1 as id) tmp > group by date > having count(distinct id) > 0; > org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 > missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if > ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], > [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268071#comment-15268071 ] Xin Wu commented on SPARK-15044: Sorry. What I meant was that after I removed the path manually, then did the alter table drop partition command in spark sql, then, I can do select. > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions
[ https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267576#comment-15267576 ] Xin Wu edited comment on SPARK-14927 at 5/2/16 10:01 PM: - right now, when a datasource table is created with partition, it is not a hive compatiable table. So maybe need to create the table like {code}create table tmp.tmp1 (val string) partitioned by (year int) stored as parquet location '' {code} Then insert into the table with a temp table that is derived from the dataframe. Something I tried below {code} scala> df.show ++---+ |year|val| ++---+ |2012| a| |2013| b| |2014| c| ++---+ scala> val df1 = spark.sql("select * from t000 where year = 2012") df1: org.apache.spark.sql.DataFrame = [year: int, val: string] scala> df1.registerTempTable("df1") scala> spark.sql("insert into tmp.ptest3 partition(year=2012) select * from df1") scala> val df2 = spark.sql("select * from t000 where year = 2013") df2: org.apache.spark.sql.DataFrame = [year: int, val: string] scala> df2.registerTempTable("df2") scala> spark.sql("insert into tmp.ptest3 partition(year=2013) select val from df2") 16/05/02 14:47:34 WARN log: Updating partition stats fast for: ptest3 16/05/02 14:47:34 WARN log: Updated size to 327 res54: org.apache.spark.sql.DataFrame = [] scala> spark.sql("show partitions tmp.ptest3").show +-+ | result| +-+ |year=2012| |year=2013| +-+ {code} This is a bit hacky though. There should be a better solution for your problem. And this is on spark 2.0. Try if 1.6 can take this. was (Author: xwu0226): right now, when a datasource table is created with partition, it is not a hive compatiable table. So maybe need to create the table like {code}create table tmp.tmp1 (val string) partitioned by (year int) stored as parquet location '' {code} Then insert into the table with a temp table that is derived from the dataframe. Something I tried below {code} scala> df.show ++---+ |year|val| ++---+ |2012| a| |2013| b| |2014| c| ++---+ scala> val df1 = spark.sql("select * from t000 where year = 2012") df1: org.apache.spark.sql.DataFrame = [year: int, val: string] scala> df1.registerTempTable("df1") scala> spark.sql("insert into tmp.ptest3 partition(year=2012) select * from df1") scala> val df2 = spark.sql("select * from t000 where year = 2013") df2: org.apache.spark.sql.DataFrame = [year: int, val: string] scala> df2.registerTempTable("df2") scala> spark.sql("insert into tmp.ptest3 partition(year=2013) select val from df2") 16/05/02 14:47:34 WARN log: Updating partition stats fast for: ptest3 16/05/02 14:47:34 WARN log: Updated size to 327 res54: org.apache.spark.sql.DataFrame = [] scala> spark.sql("show partitions tmp.ptest3").show +-+ | result| +-+ |year=2012| |year=2013| +-+ {code} This is a bit hacky though. hope someone has a better solution for your problem. And this is on spark 2.0. Try if 1.6 can take this. > DataFrame. saveAsTable creates RDD partitions but not Hive partitions > - > > Key: SPARK-14927 > URL: https://issues.apache.org/jira/browse/SPARK-14927 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1 > Environment: Mac OS X 10.11.4 local >Reporter: Sasha Ovsankin > > This is a followup to > http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive > . I tried to use suggestions in the answers but couldn't make it to work in > Spark 1.6.1 > I am trying to create partitions programmatically from `DataFrame. Here is > the relevant code (adapted from a Spark test): > hc.setConf("hive.metastore.warehouse.dir", "tmp/tests") > //hc.setConf("hive.exec.dynamic.partition", "true") > //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict") > hc.sql("create database if not exists tmp") > hc.sql("drop table if exists tmp.partitiontest1") > Seq(2012 -> "a").toDF("year", "val") > .write > .partitionBy("year") > .mode(SaveMode.Append) > .saveAsTable("tmp.partitiontest1") > hc.sql("show partitions tmp.partitiontest1").show > Full file is here: > https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a > I get the error that the table is not partitioned: > == > HIVE FAILURE OUTPUT > == > SET hive.support.sql11.reserved.keywords=false > SET hive.metastore.warehouse.dir=tmp/tests > OK > OK > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a > partitioned table > == > It looks like the root cause is that >
[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions
[ https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267576#comment-15267576 ] Xin Wu commented on SPARK-14927: right now, when a datasource table is created with partition, it is not a hive compatiable table. So maybe need to create the table like {code}create table tmp.tmp1 (val string) partitioned by (year int) stored as parquet location '' {code} Then insert into the table with a temp table that is derived from the dataframe. Something I tried below {code} scala> df.show ++---+ |year|val| ++---+ |2012| a| |2013| b| |2014| c| ++---+ scala> val df1 = spark.sql("select * from t000 where year = 2012") df1: org.apache.spark.sql.DataFrame = [year: int, val: string] scala> df1.registerTempTable("df1") scala> spark.sql("insert into tmp.ptest3 partition(year=2012) select * from df1") scala> val df2 = spark.sql("select * from t000 where year = 2013") df2: org.apache.spark.sql.DataFrame = [year: int, val: string] scala> df2.registerTempTable("df2") scala> spark.sql("insert into tmp.ptest3 partition(year=2013) select val from df2") 16/05/02 14:47:34 WARN log: Updating partition stats fast for: ptest3 16/05/02 14:47:34 WARN log: Updated size to 327 res54: org.apache.spark.sql.DataFrame = [] scala> spark.sql("show partitions tmp.ptest3").show +-+ | result| +-+ |year=2012| |year=2013| +-+ {code} This is a bit hacky though. hope someone has a better solution for your problem. And this is on spark 2.0. Try if 1.6 can take this. > DataFrame. saveAsTable creates RDD partitions but not Hive partitions > - > > Key: SPARK-14927 > URL: https://issues.apache.org/jira/browse/SPARK-14927 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1 > Environment: Mac OS X 10.11.4 local >Reporter: Sasha Ovsankin > > This is a followup to > http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive > . I tried to use suggestions in the answers but couldn't make it to work in > Spark 1.6.1 > I am trying to create partitions programmatically from `DataFrame. Here is > the relevant code (adapted from a Spark test): > hc.setConf("hive.metastore.warehouse.dir", "tmp/tests") > //hc.setConf("hive.exec.dynamic.partition", "true") > //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict") > hc.sql("create database if not exists tmp") > hc.sql("drop table if exists tmp.partitiontest1") > Seq(2012 -> "a").toDF("year", "val") > .write > .partitionBy("year") > .mode(SaveMode.Append) > .saveAsTable("tmp.partitiontest1") > hc.sql("show partitions tmp.partitiontest1").show > Full file is here: > https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a > I get the error that the table is not partitioned: > == > HIVE FAILURE OUTPUT > == > SET hive.support.sql11.reserved.keywords=false > SET hive.metastore.warehouse.dir=tmp/tests > OK > OK > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a > partitioned table > == > It looks like the root cause is that > `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable` > always creates table with empty partitions. > Any help to move this forward is appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266116#comment-15266116 ] Xin Wu commented on SPARK-15044: I tried {code}alter table test drop partition (p=1){code} , then the select will return 0 rows without exception. > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14495) Distinct aggregation cannot be used in the having clause
[ https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266069#comment-15266069 ] Xin Wu edited comment on SPARK-14495 at 5/2/16 2:21 AM: I can recreate it on branch-1.6. and another workaround is using alias for the aggregate expression {code} scala> sqlContext.sql("SELECT date, count(distinct id) as cnt from (select '2010-01-01' as date, 1 as id) tmp group by date having cnt > 0").show +--+---+ | date|cnt| +--+---+ |2010-01-01| 1| +--+---+ {code} was (Author: xwu0226): I can recreated it on branch-1.6. and another workaround is using alias for the aggregate expression {code} scala> sqlContext.sql("SELECT date, count(distinct id) as cnt from (select '2010-01-01' as date, 1 as id) tmp group by date having cnt > 0").show +--+---+ | date|cnt| +--+---+ |2010-01-01| 1| +--+---+ {code} > Distinct aggregation cannot be used in the having clause > > > Key: SPARK-14495 > URL: https://issues.apache.org/jira/browse/SPARK-14495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yin Huai > > {code} > select date, count(distinct id) > from (select '2010-01-01' as date, 1 as id) tmp > group by date > having count(distinct id) > 0; > org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 > missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if > ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], > [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14495) Distinct aggregation cannot be used in the having clause
[ https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266069#comment-15266069 ] Xin Wu commented on SPARK-14495: I can recreated it on branch-1.6. and another workaround is using alias for the aggregate expression {code} scala> sqlContext.sql("SELECT date, count(distinct id) as cnt from (select '2010-01-01' as date, 1 as id) tmp group by date having cnt > 0").show +--+---+ | date|cnt| +--+---+ |2010-01-01| 1| +--+---+ {code} > Distinct aggregation cannot be used in the having clause > > > Key: SPARK-14495 > URL: https://issues.apache.org/jira/browse/SPARK-14495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yin Huai > > {code} > select date, count(distinct id) > from (select '2010-01-01' as date, 1 as id) tmp > group by date > having count(distinct id) > 0; > org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 > missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if > ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], > [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions
[ https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265591#comment-15265591 ] Xin Wu commented on SPARK-14927: Since Spark 2.0.0 has moved around a lot of stuff, including splitting the HiveMetaStoreCatalog into 2 files for resolving and creating tables, respectively, I would try this on Spark 2.0.0. {code}scala> spark.sql("create database if not exists tmp") 16/04/30 19:59:12 WARN ObjectStore: Failed to get database tmp, returning NoSuchObjectException res23: org.apache.spark.sql.DataFrame = [] scala> df.write.partitionBy("year").mode(SaveMode.Append).saveAsTable("tmp.tmp1") 16/04/30 19:59:50 WARN CreateDataSourceTableUtils: Persisting partitioned data source relation `tmp`.`tmp1` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s): file:/home/xwu0226/spark/spark-warehouse/tmp.db/tmp1 scala> spark.sql("select * from tmp.tmp1").show +---++ |val|year| +---++ | a|2012| +---++ {code} For datasource table creation as above, SparkSQL will create the table as a hive internal table but not compatible with hive. SparkSQL puts partition column information (actually including also other things like column schema, bucket/sort columns) into serdeInfo.parameters. When querying the table, SparkSQL resolve the table and parse the information back from serdeInfo.parameters. Spark 2.0.0 does not pass this command to Hive anymore (actually most of DDL commands are run natively in SparkSQL now), so when doing "SHOW PARTITIONS...", the command now does not support showing partitions for datasource table. {code} scala> spark.sql("show partitions tmp.tmp1").show org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on a datasource table: tmp.tmp1; at org.apache.spark.sql.execution.command.ShowPartitionsCommand.run(commands.scala:196) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:132) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:129) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:112) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) at org.apache.spark.sql.Dataset.(Dataset.scala:186) at org.apache.spark.sql.Dataset.(Dataset.scala:167) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:529) ... 48 elided {code} Hope this helps. > DataFrame. saveAsTable creates RDD partitions but not Hive partitions > - > > Key: SPARK-14927 > URL: https://issues.apache.org/jira/browse/SPARK-14927 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1 > Environment: Mac OS X 10.11.4 local >Reporter: Sasha Ovsankin > > This is a followup to > http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive > . I tried to use suggestions in the answers but couldn't make it to work in > Spark 1.6.1 > I am trying to create partitions programmatically from `DataFrame. Here is > the relevant code (adapted from a Spark test): > hc.setConf("hive.metastore.warehouse.dir", "tmp/tests") > //hc.setConf("hive.exec.dynamic.partition", "true") > //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict") > hc.sql("create database if not exists tmp") > hc.sql("drop table if exists tmp.partitiontest1") > Seq(2012 -> "a").toDF("year", "val") > .write > .partitionBy("year") > .mode(SaveMode.Append) > .saveAsTable("tmp.partitiontest1") > hc.sql("show partitions tmp.partitiontest1").show > Full file is here: > https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a > I get the error that the table is not partitioned: > == > HIVE FAILURE OUTPUT > == > SET hive.support.sql11.reserved.keywords=false > SET hive.metastore.warehouse.dir=tmp/tests > OK > OK > FAILED: Execution Error, return code 1 from >
[jira] [Commented] (SPARK-15025) creating datasource table with option (PATH) results in duplicate path key in serdeProperties
[ https://issues.apache.org/jira/browse/SPARK-15025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265127#comment-15265127 ] Xin Wu commented on SPARK-15025: I am testing a fix on this . will submit a PR soon. > creating datasource table with option (PATH) results in duplicate path key in > serdeProperties > - > > Key: SPARK-15025 > URL: https://issues.apache.org/jira/browse/SPARK-15025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > Repro: > {code}create table t1 using parquet options (PATH "/tmp/t1") as select 1 as > a, 2 as b{code} > This will create a hive external table whose dataLocation is > "/someDefaultPath", which is not the same as the provided one. Yet, > serdeInfo.parameters contain following key value pairs: > PATH, "/tmp/t1" > path, "/someDefaultPath" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15025) creating datasource table with option (PATH) results in duplicate path key in serdeProperties
Xin Wu created SPARK-15025: -- Summary: creating datasource table with option (PATH) results in duplicate path key in serdeProperties Key: SPARK-15025 URL: https://issues.apache.org/jira/browse/SPARK-15025 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xin Wu Repro: {code}create table t1 using parquet options (PATH "/tmp/t1") as select 1 as a, 2 as b{code} This will create a hive external table whose dataLocation is "/someDefaultPath", which is not the same as the provided one. Yet, serdeInfo.parameters contain following key value pairs: PATH, "/tmp/t1" path, "/someDefaultPath" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14933) Failed to create view out of a parquet or orc table
[ https://issues.apache.org/jira/browse/SPARK-14933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259084#comment-15259084 ] Xin Wu commented on SPARK-14933: I have a fix for this and will submit a PR soon. > Failed to create view out of a parquet or orc table > > > Key: SPARK-14933 > URL: https://issues.apache.org/jira/browse/SPARK-14933 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu >Priority: Critical > Fix For: 2.0.0 > > > When I create a table as parquet or orc with following DDL: > {code} > create table t1 (c1 int, c2 string) stored as parquet; > create table t2 (c1 int, c2 string) stored as orc; > {code} > Then, do: > {code}create view v1 as select * from t1;{code} > The view creation fails because of following error: > {code} > Caused by: java.lang.UnsupportedOperationException: unsupported plan > Relation[c1#66,c2#67] HadoopFiles > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:191) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:149) > at > org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:208) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:111) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:149) > at > org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:208) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:111) > at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:81) > at > org.apache.spark.sql.catalyst.LogicalPlanToSQLSuite.org$apache$spark$sql$catalyst$LogicalPlanToSQLSuite$$checkHiveQl(LogicalPlanToSQLSuite.scala:82) > ... 57 more > {code} > The error actually happens in the path of converting LogicalPlan to SQL for > the LogicalRelation of the HadoopFsRelation (t1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14933) Failed to create view out of a parquet or orc table
Xin Wu created SPARK-14933: -- Summary: Failed to create view out of a parquet or orc table Key: SPARK-14933 URL: https://issues.apache.org/jira/browse/SPARK-14933 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xin Wu Priority: Critical Fix For: 2.0.0 When I create a table as parquet or orc with following DDL: {code} create table t1 (c1 int, c2 string) stored as parquet; create table t2 (c1 int, c2 string) stored as orc; {code} Then, do: {code}create view v1 as select * from t1;{code} The view creation fails because of following error: {code} Caused by: java.lang.UnsupportedOperationException: unsupported plan Relation[c1#66,c2#67] HadoopFiles at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:191) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:149) at org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:208) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:111) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:149) at org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:208) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:111) at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:81) at org.apache.spark.sql.catalyst.LogicalPlanToSQLSuite.org$apache$spark$sql$catalyst$LogicalPlanToSQLSuite$$checkHiveQl(LogicalPlanToSQLSuite.scala:82) ... 57 more {code} The error actually happens in the path of converting LogicalPlan to SQL for the LogicalRelation of the HadoopFsRelation (t1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14361) Support EXCLUDE clause in Window function framing
[ https://issues.apache.org/jira/browse/SPARK-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-14361: --- Description: The current Spark SQL does not support the exclusion clause in Window function framing, which is part of ANSI SQL2003’s Window syntax. For example, IBM Netezza fully supports it as shown in the https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html). We propose to implement it in this JIRA.. The ANSI SQL2003's Window Syntax: {code} FUNCTION_NAME(expr) OVER {window_name | (window_specification)} window_specification ::= [window_name] [partitioning] [ordering] [framing] partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name] ordering ::= ORDER [SIBLINGS] BY rule[, rule...] rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}] framing ::= {ROWS | RANGE} {start | between} [exclusion] start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW} between ::= BETWEEN bound AND bound bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING} exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE NO OTHERS} {code} was: The current Spark SQL does not support the `exclusion` clause, which is part of ANSI SQL2003’s `Window` syntax. For example, IBM Netezza fully supports it as shown in the [document web link] (https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_wi ndow_aggregation_family_syntax.html). This PR is to fill the gap. > Support EXCLUDE clause in Window function framing > - > > Key: SPARK-14361 > URL: https://issues.apache.org/jira/browse/SPARK-14361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > The current Spark SQL does not support the exclusion clause in Window > function framing, which is part of ANSI SQL2003’s Window syntax. For example, > IBM Netezza fully supports it as shown in the > https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html). > We propose to implement it in this JIRA.. > The ANSI SQL2003's Window Syntax: > {code} > FUNCTION_NAME(expr) OVER {window_name | (window_specification)} > window_specification ::= [window_name] [partitioning] [ordering] [framing] > partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name] > ordering ::= ORDER [SIBLINGS] BY rule[, rule...] > rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}] > framing ::= {ROWS | RANGE} {start | between} [exclusion] > start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW} > between ::= BETWEEN bound AND bound > bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING} > exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE > NO OTHERS} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14361) Support EXCLUDE clause in Window function framing
[ https://issues.apache.org/jira/browse/SPARK-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223702#comment-15223702 ] Xin Wu edited comment on SPARK-14361 at 4/4/16 6:06 AM: [~hvanhovell] Since you coded the whole window function.. I would like to have you take a look at the PR proposal.. I will submit a PR soon. was (Author: xwu0226): [~smilegator][~dkbiswal] > Support EXCLUDE clause in Window function framing > - > > Key: SPARK-14361 > URL: https://issues.apache.org/jira/browse/SPARK-14361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > The current Spark SQL does not support the *exclude* clause in Window > function framing clause, which is part of ANSI SQL2003's Window syntax. For > example, IBM Netezza fully supports it as shown in the > https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html).. > We propose to implement it in the JIRA. > The ANSI SQL2003's Window syntax: > {code} > FUNCTION_NAME(expr) OVER {window_name | (window_specification)} > window_specification ::= [window_name] [partitioning] [ordering] [framing] > partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name] > ordering ::= ORDER [SIBLINGS] BY rule[, rule...] > rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}] > framing ::= {ROWS | RANGE} {start | between} [exclusion] > start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW} > between ::= BETWEEN bound AND bound > bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING} > exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE > NO OTHERS} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14361) Support EXCLUDE clause in Window function framing
[ https://issues.apache.org/jira/browse/SPARK-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223702#comment-15223702 ] Xin Wu commented on SPARK-14361: [~smilegator][~dkbiswal] > Support EXCLUDE clause in Window function framing > - > > Key: SPARK-14361 > URL: https://issues.apache.org/jira/browse/SPARK-14361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > The current Spark SQL does not support the *exclude* clause in Window > function framing clause, which is part of ANSI SQL2003's Window syntax. For > example, IBM Netezza fully supports it as shown in the > https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html).. > We propose to implement it in the JIRA. > The ANSI SQL2003's Window syntax: > {code} > FUNCTION_NAME(expr) OVER {window_name | (window_specification)} > window_specification ::= [window_name] [partitioning] [ordering] [framing] > partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name] > ordering ::= ORDER [SIBLINGS] BY rule[, rule...] > rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}] > framing ::= {ROWS | RANGE} {start | between} [exclusion] > start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW} > between ::= BETWEEN bound AND bound > bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING} > exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE > NO OTHERS} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14361) Support EXCLUDE clause in Window function framing
[ https://issues.apache.org/jira/browse/SPARK-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-14361: --- Description: The current Spark SQL does not support the *exclude* clause in Window function framing clause, which is part of ANSI SQL2003's Window syntax. For example, IBM Netezza fully supports it as shown in the https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html).. We propose to implement it in the JIRA. The ANSI SQL2003's Window syntax: {code} FUNCTION_NAME(expr) OVER {window_name | (window_specification)} window_specification ::= [window_name] [partitioning] [ordering] [framing] partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name] ordering ::= ORDER [SIBLINGS] BY rule[, rule...] rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}] framing ::= {ROWS | RANGE} {start | between} [exclusion] start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW} between ::= BETWEEN bound AND bound bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING} exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE NO OTHERS} {code} was:The current Spark SQL does not support the {code}exclude{code} clause in Window function framing clause, which is part of ANSI SQL2003's > Support EXCLUDE clause in Window function framing > - > > Key: SPARK-14361 > URL: https://issues.apache.org/jira/browse/SPARK-14361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > The current Spark SQL does not support the *exclude* clause in Window > function framing clause, which is part of ANSI SQL2003's Window syntax. For > example, IBM Netezza fully supports it as shown in the > https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_window_aggregation_family_syntax.html).. > We propose to implement it in the JIRA. > The ANSI SQL2003's Window syntax: > {code} > FUNCTION_NAME(expr) OVER {window_name | (window_specification)} > window_specification ::= [window_name] [partitioning] [ordering] [framing] > partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name] > ordering ::= ORDER [SIBLINGS] BY rule[, rule...] > rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}] > framing ::= {ROWS | RANGE} {start | between} [exclusion] > start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW} > between ::= BETWEEN bound AND bound > bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING} > exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE > NO OTHERS} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14361) Support EXCLUDE clause in Window function framing
[ https://issues.apache.org/jira/browse/SPARK-14361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-14361: --- Description: The current Spark SQL does not support the {code}exclude{code} clause in Window function framing clause, which is part of ANSI SQL2003's (was: The current Spark SQL does not support the `exclusion` clause, which is part of ANSI SQL2003’s `Window` syntax. For example, IBM Netezza fully supports it as shown in the [document web link] (https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_wi ndow_aggregation_family_syntax.html). We propose to support it in this JIRA. # Introduction Below is the ANSI SQL2003’s `Window` syntax: ``` FUNCTION_NAME(expr) OVER {window_name | (window_specification)} window_specification ::= [window_name] [partitioning] [ordering] [framing] partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name] ordering ::= ORDER [SIBLINGS] BY rule[, rule...] rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}] framing ::= {ROWS | RANGE} {start | between} [exclusion] start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW} between ::= BETWEEN bound AND bound bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING} exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE NO OTHERS} ``` Exclusion clause can be used to excluded certain rows from the window framing when calculating window aggregation function (e.g. AVG, SUM, MAX, MIN, COUNT, etc) related to current row. Types of window functions that are not supported are listed below: 1. Offset functions, such as lead(), lag() 2. Ranking functions, such as rank(), dense_rank(), percent_rank(), cume_dist, ntile() 3. Row number function, such as row_number() # Definition Syntax | Description | - EXCLUDE CURRENT ROW | Specifies excluding the current row. EXCLUDE GROUP | Specifies excluding the current row and all rows that are tied with it. Ties occur when there is a match on the order column or columns. EXCLUDE NO OTHERS | Specifies not excluding any rows. This value is the default if you specify no exclusion. EXCLUDE TIES | Specifies excluding all rows that are tied with the current row (peer rows), but retaining the current row. # Use-case Examples: - Let's say you want to find out for every employee, where is his/her salary at compared to the average salary of those within the same department and whose ages are within 5 years younger and older. The query could be: ```SQL SELECT NAME, DEPT_ID, SALARY, AGE, AVG(SALARY) AS AVG_WITHIN_5_YEAR OVER(PARTITION BY DEPT_ID ORDER BY AGE RANGE BETWEEN 5 PRECEDING AND 5 FOLLOWING EXCLUDE CURRENT ROW) FROM EMPLOYEE ``` - Let's say you want to compare every customer's yearly purchase with other customers' average yearly purchase who are at different age group from the current customer. The query could be: ```SQL SELECT CUST_NAME, AGE, PROD_CATEGORY, YEARLY_PURCHASE, AVG(YEARLY_PURCHASE) OVER(PARTITION BY PROD_CATEGORY ORDER BY AGE RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUND FOLLOWING EXCLUDE GROUP) FROM CUSTOMER_PURCHASE_SUM ```) > Support EXCLUDE clause in Window function framing > - > > Key: SPARK-14361 > URL: https://issues.apache.org/jira/browse/SPARK-14361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > The current Spark SQL does not support the {code}exclude{code} clause in > Window function framing clause, which is part of ANSI SQL2003's -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14361) Support EXCLUDE clause in Window function framing
Xin Wu created SPARK-14361: -- Summary: Support EXCLUDE clause in Window function framing Key: SPARK-14361 URL: https://issues.apache.org/jira/browse/SPARK-14361 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xin Wu The current Spark SQL does not support the `exclusion` clause, which is part of ANSI SQL2003’s `Window` syntax. For example, IBM Netezza fully supports it as shown in the [document web link] (https://www.ibm.com/support/knowledgecenter/SSULQD_7.1.0/com.ibm.nz.dbu.doc/c_dbuser_wi ndow_aggregation_family_syntax.html). We propose to support it in this JIRA. # Introduction Below is the ANSI SQL2003’s `Window` syntax: ``` FUNCTION_NAME(expr) OVER {window_name | (window_specification)} window_specification ::= [window_name] [partitioning] [ordering] [framing] partitioning ::= PARTITION BY value[, value...] [COLLATE collation_name] ordering ::= ORDER [SIBLINGS] BY rule[, rule...] rule ::= {value | position | alias} [ASC | DESC] [NULLS {FIRST | LAST}] framing ::= {ROWS | RANGE} {start | between} [exclusion] start ::= {UNBOUNDED PRECEDING | unsigned-integer PRECEDING | CURRENT ROW} between ::= BETWEEN bound AND bound bound ::= {start | UNBOUNDED FOLLOWING | unsigned-integer FOLLOWING} exclusion ::= {EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE NO OTHERS} ``` Exclusion clause can be used to excluded certain rows from the window framing when calculating window aggregation function (e.g. AVG, SUM, MAX, MIN, COUNT, etc) related to current row. Types of window functions that are not supported are listed below: 1. Offset functions, such as lead(), lag() 2. Ranking functions, such as rank(), dense_rank(), percent_rank(), cume_dist, ntile() 3. Row number function, such as row_number() # Definition Syntax | Description | - EXCLUDE CURRENT ROW | Specifies excluding the current row. EXCLUDE GROUP | Specifies excluding the current row and all rows that are tied with it. Ties occur when there is a match on the order column or columns. EXCLUDE NO OTHERS | Specifies not excluding any rows. This value is the default if you specify no exclusion. EXCLUDE TIES | Specifies excluding all rows that are tied with the current row (peer rows), but retaining the current row. # Use-case Examples: - Let's say you want to find out for every employee, where is his/her salary at compared to the average salary of those within the same department and whose ages are within 5 years younger and older. The query could be: ```SQL SELECT NAME, DEPT_ID, SALARY, AGE, AVG(SALARY) AS AVG_WITHIN_5_YEAR OVER(PARTITION BY DEPT_ID ORDER BY AGE RANGE BETWEEN 5 PRECEDING AND 5 FOLLOWING EXCLUDE CURRENT ROW) FROM EMPLOYEE ``` - Let's say you want to compare every customer's yearly purchase with other customers' average yearly purchase who are at different age group from the current customer. The query could be: ```SQL SELECT CUST_NAME, AGE, PROD_CATEGORY, YEARLY_PURCHASE, AVG(YEARLY_PURCHASE) OVER(PARTITION BY PROD_CATEGORY ORDER BY AGE RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUND FOLLOWING EXCLUDE GROUP) FROM CUSTOMER_PURCHASE_SUM ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14346) SHOW CREATE TABLE command (Native)
[ https://issues.apache.org/jira/browse/SPARK-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-14346: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-14118 > SHOW CREATE TABLE command (Native) > -- > > Key: SPARK-14346 > URL: https://issues.apache.org/jira/browse/SPARK-14346 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > This command will return a CREATE TABLE command in SQL. Right now, we just > throw exception (I was not sure how often people will use it). Since it is a > pretty standalone work (generating a CREATE TABLE command based on the > metadata of a table) and people may find it pretty useful, I am thinking to > get it in 2.0. Hive's implementation can be found at > https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L1898-L2126. > The main difference for spark is that if we have a data source table, we > should use Spark's syntax (CREATE TABLE ... USING ... OPTIONS) instead of > Hive's syntax. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14346) SHOW CREATE TABLE command (Native)
Xin Wu created SPARK-14346: -- Summary: SHOW CREATE TABLE command (Native) Key: SPARK-14346 URL: https://issues.apache.org/jira/browse/SPARK-14346 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xin Wu This command will return a CREATE TABLE command in SQL. Right now, we just throw exception (I was not sure how often people will use it). Since it is a pretty standalone work (generating a CREATE TABLE command based on the metadata of a table) and people may find it pretty useful, I am thinking to get it in 2.0. Hive's implementation can be found at https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L1898-L2126. The main difference for spark is that if we have a data source table, we should use Spark's syntax (CREATE TABLE ... USING ... OPTIONS) instead of Hive's syntax. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14096) SPARK-SQL CLI returns NPE
[ https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210978#comment-15210978 ] Xin Wu commented on SPARK-14096: After I comment out the kryo serialization setting in SparkSQLEnv.init, that is used by spark-sql console. the query returns without NPE. When kryo serialization is used, the query fails when ORDER BY and LIMIT is combined. After removing either ORDER BY or LIMIT clause, the query also runs. > SPARK-SQL CLI returns NPE > - > > Key: SPARK-14096 > URL: https://issues.apache.org/jira/browse/SPARK-14096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > Trying to run TPCDS query 06 in spark-sql shell received the following error > in the middle of a stage; but running another query 38 succeeded: > NPE: > {noformat} > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage > 10.0 (TID 622) in 171 ms on localhost (30/200) > 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting > task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:645) > at java.util.PriorityQueue.offer(PriorityQueue.java:344) > at java.util.PriorityQueue.add(PriorityQueue.java:321) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) > ... 15 more > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage > 10.0 (TID 623) in 171 ms on localhost (31/200) > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > {noformat} > query 06 (caused the above NPE): > {noformat} > select a.ca_state state, count(*) cnt > from customer_address a > join customer c on a.ca_address_sk = c.c_current_addr_sk > join store_sales s on c.c_customer_sk = s.ss_customer_sk > join date_dim d on s.ss_sold_date_sk = d.d_date_sk > join item i on s.ss_item_sk = i.i_item_sk > join (select distinct d_month_seq > from date_dim >where d_year = 2001 > and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq > join > (select j.i_category, avg(j.i_current_price) as
[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209748#comment-15209748 ] Xin Wu commented on SPARK-13832: [~jfc...@us.ibm.com] For the above execution issue, i think it is a duplicate of SPARK-14096. I think you can close this JIRA and refer to SPARK-14096 for the kyro exception issue. Thanks! > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14096) SPARK-SQL CLI returns NPE
[ https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209737#comment-15209737 ] Xin Wu commented on SPARK-14096: I simplied the query to: {code}select * from item order by i_item_id limit 100;{code} And it fails with exception:{code} Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669) at java.util.PriorityQueue.siftUp(PriorityQueue.java:645) at java.util.PriorityQueue.offer(PriorityQueue.java:344) at java.util.PriorityQueue.add(PriorityQueue.java:321) at com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) at com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) {code} And removing either "ORDER BY" or "LIMIT" clause will pass.. > SPARK-SQL CLI returns NPE > - > > Key: SPARK-14096 > URL: https://issues.apache.org/jira/browse/SPARK-14096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > Trying to run TPCDS query 06 in spark-sql shell received the following error > in the middle of a stage; but running another query 38 succeeded: > NPE: > {noformat} > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage > 10.0 (TID 622) in 171 ms on localhost (30/200) > 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting > task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:645) > at java.util.PriorityQueue.offer(PriorityQueue.java:344) > at java.util.PriorityQueue.add(PriorityQueue.java:321) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) > ... 15 more > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all
[jira] [Commented] (SPARK-14096) SPARK-SQL CLI returns NPE
[ https://issues.apache.org/jira/browse/SPARK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209688#comment-15209688 ] Xin Wu commented on SPARK-14096: [~jfc...@us.ibm.com] Can you try the query without ORDER BY? I noticed another query failed with the ORDER BY while succeeded without the ORDER BY. > SPARK-SQL CLI returns NPE > - > > Key: SPARK-14096 > URL: https://issues.apache.org/jira/browse/SPARK-14096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN > > Trying to run TPCDS query 06 in spark-sql shell received the following error > in the middle of a stage; but running another query 38 succeeded: > NPE: > {noformat} > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 65.0 in stage > 10.0 (TID 622) in 171 ms on localhost (30/200) > 16/03/22 15:12:56 ERROR scheduler.TaskResultGetter: Exception while getting > task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1790) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:645) > at java.util.PriorityQueue.offer(PriorityQueue.java:344) > at java.util.PriorityQueue.add(PriorityQueue.java:321) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) > ... 15 more > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > 16/03/22 15:12:56 INFO scheduler.TaskSetManager: Finished task 66.0 in stage > 10.0 (TID 623) in 171 ms on localhost (31/200) > 16/03/22 15:12:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, > whose tasks have all completed, from pool > {noformat} > query 06 (caused the above NPE): > {noformat} > select a.ca_state state, count(*) cnt > from customer_address a > join customer c on a.ca_address_sk = c.c_current_addr_sk > join store_sales s on c.c_customer_sk = s.ss_customer_sk > join date_dim d on s.ss_sold_date_sk = d.d_date_sk > join item i on s.ss_item_sk = i.i_item_sk > join (select distinct d_month_seq > from date_dim >where d_year = 2001 > and d_moy = 1 ) tmp1 ON d.d_month_seq = tmp1.d_month_seq > join > (select j.i_category, avg(j.i_current_price) as avg_i_current_price >from item j group by j.i_category) tmp2 on tmp2.i_category = > i.i_category > where > i.i_current_price > 1.2 *
[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209687#comment-15209687 ] Xin Wu commented on SPARK-13832: The analysis issue reported in this jira is resolved in spark 2.0.. For the kyro exception during execution, the query can return without the ORDER BY.. so I am also looking into why ORDER BY clause triggers this kyro Exception. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209220#comment-15209220 ] Xin Wu commented on SPARK-13832: [~jfc...@us.ibm.com] I think when using grouping_id(), you need to pass in all the columns that are in the group by clause. In this case, it will be grouping_id(i_category, i_class). The result is like concatenating results of grouping() into a bit vector (a string of ones and zeros), such as grouping(i_category)+grouping(i_class) So {code}grouping_id(i_category)+grouping_id(i_class){code} is not correct. After I changed to use {code}grouping_id(i_category, i_class){code}, the query returns for the text data files.. I am trying for the parquet files now. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200053#comment-15200053 ] Xin Wu commented on SPARK-13863: Jesse, after I modified the DDL to use "decimal(7,2)" for the "double" colums as documented in the tpc-ds specs and the query return the following results both from Hive and Spark SQL: Spark SQL: {code} NULLNULLFairviewWilliamson County TN United States DHL,BARIAN 20019597806.95 11121820.57 8670867.91 8994786.04 10887248.09 14187671.36 9732598.41 19798897.07 21007842.34 21495513.67 34795669.17 33122997.94 NULLNULL NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL 21913594.59 32518476.51 24885662.72 25698343.86 33735910.61 35527031.58 25465193.48 53623238.66 51409986.76 54159173.9 92227043.25 83435390.84 Bad cards must make.621234 FairviewWilliamson County TN United States DHL,BARIAN 20019506753.46 8008140.33 6116769.63 11973045.15 7756254.92 5352978.49 13733996.1 16418794.37 17212743.32 17042707.41 34304935.61 35324164.21 15.303015385507 12.89069871 9.846160432301 19.273003650798 12.485238927683 8.616686288902 22.107605346777 26.429323523825 27.707342676029 27.433635972918 55.220634430827 56.861286101534 30534943.77 24481685.94 22178710.81 25695798.18 29954903.78 18084140.05 30805576.13 47156887.22 51158588.86 55759942.8 86253544.16 83451555.63 Conventional childr 977787 FairviewWilliamson County TN United States DHL,BARIAN 20018860645.55 14415813.74 6761497.23 11820654.76 8246260.69 6636877.49 11434492.25 25673812.14 23074206.96 21834581.94 26894900.53 33575091.74 9.061938387399 14.743306814265 6.915102399603 12.089191981484 8.433596161537 6.787651594877 11.694256775759 26.257060218637 23.598398178745 22.330611820366 27.50538776 34.337838138572 23836085.83 32073313.37 25037904.18 22659895.86 21757401.03 24451608.1 21933001.85 55996703.43 57371880.44 62087214.51 82849910.15 88970319.31 Doors canno 294242 FairviewWilliamson County TN United States DHL,BARIAN 20016355232.31 10198920.36 10246200.97 12209716.5 8566998.28 8806316.81 9789405.6 16466584.88 26443785.61 27016047.8 33660589.67 27462468.62 21.598657941422 34.66167426812 34.822360404021 41.495491806065 29.115484125312 29.928823247531 33.269912520986 55.962727550791 89.870873668613 91.815742823934 114.39763755684193.332932144289 22645143.09 24487254.6 24925759.42 30503655.27 26558160.29 20976233.52 29895796.09 56002198.38 53488158.53 76287235.46 82483747.59 88088266.69 Important issues liv138504 FairviewWilliamson County TN United States DHL,BARIAN 200111748784.55 14351305.77 9896470.93 7990874.78 8879247.9 7362383.09 10011144.75 17741201.32 21346976.05 18074978.16 29675125.64 32545325.29 84.826319456478 103.61654370992971.452600141512 57.694180529082 64.108241639231 53.156465445041 72.280546049212 128.091616993011 154.12533970138 130.501488476867214.254647086005 234.97751176861427204167.15 25980378.13 19943398.93 25710421.13 19484481.03 26346611.48 25075158.43 54094778.13 41066732.11 54547058.28 72465962.92 92770328.27 {code} Hive: {code} NULLNULLFairviewWilliamson County TN United States DHL,BARIAN 20019597806.95 11121820.57 8670867.91 8994786.04 10887248.09 14187671.36 9732598.41 19798897.07 21007842.34 21495513.67 34795669.17 33122997.94 NULLNULL NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL 21913594.59 32518476.51 24885662.72 25698343.86 33735910.61 35527031.58 25465193.48 53623238.66 51409986.76 54159173.9 92227043.25 83435390.84 Bad cards must make.621234 FairviewWilliamson County TN United States DHL,BARIAN 20019506753.46 8008140.33 6116769.63 11973045.15 7756254.92 5352978.49 13733996.1 16418794.37 17212743.32 17042707.41 34304935.61 35324164.21 15.303015385507 12.89069871 9.846160432301 19.273003650798 12.485238927683 8.616686288902 22.107605346777 26.429323523825 27.707342676029 27.433635972918 55.220634430827 56.861286101534 30534943.77
[jira] [Commented] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200160#comment-15200160 ] Xin Wu commented on SPARK-13863: In terms of the ordering. the only difference is that the row with Null value for the order by column (w_warehouse_name) is placed at the top for HIve and Spark SQL, while the expected result has it at the bottom. Other rows are OK. So the it seems the expected results have NULL row in the wrong place. > TPCDS query 66 returns wrong results compared to TPC official result set > - > > Key: SPARK-13863 > URL: https://issues.apache.org/jira/browse/SPARK-13863 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 66 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > Aggregations slightly off -- eg. JAN_SALES column of "Doors canno" row - > SparkSQL returns 6355232.185385704, expected 6355232.31 > Actual results: > {noformat} > [null,null,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,9597806.850651741,1.1121820530080795E7,8670867.81564045,8994785.945689201,1.088724806326294E7,1.4187671518377304E7,9732598.460139751,1.9798897020946026E7,2.1007842467959404E7,2.149551364927292E7,3.479566905774999E7,3.3122997954660416E7,null,null,null,null,null,null,null,null,null,null,null,null,2.191359469742E7,3.2518476414670944E7,2.48856624883976E7,2.5698343830046654E7,3.373591080598068E7,3.552703167087555E7,2.5465193481492043E7,5.362323870799959E7,5.1409986978201866E7,5.415917383586836E7,9.222704311805725E7,8.343539111531019E7] > [Bad cards must make.,621234,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,9506753.593884468,8008140.429557085,6116769.711647987,1.1973045160133362E7,7756254.925520897,5352978.574095726,1.373399613500309E7,1.6418794411203384E7,1.7212743279764652E7,1.704270732417488E7,3.43049358570323E7,3.532416421229005E7,15.30301560102066,12.890698882477594,9.846160563729589,19.273003667109915,12.485238936569628,8.61668642427125,22.107605403121994,26.429323590150222,27.707342611261865,27.433635834765774,55.22063482847413,56.86128610521969,3.0534943928382874E7,2.4481686250203133E7,2.217871080008793E7,2.569579825610423E7,2.995490355044937E7,1.8084140250833035E7,3.0805576178061485E7,4.7156887432252884E7,5.115858869637826E7,5.5759943171424866E7,8.625354428184557E7,8.345155532035494E7] > [Conventional childr,977787,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,8860645.460736752,1.441581376543355E7,6761497.232810497,1.1820654735879421E7,8246260.600341797,6636877.482845306,1.1434492123092413E7,2.5673812070380323E7,2.307420611785E7,2.1834582007320404E7,2.6894900596512794E7,3.357509177109933E7,9.061938296108202,14.743306840276613,6.9151024024767125,12.08919195681618,8.43359606984118,6.787651587559771,11.694256645969329,26.257060147435304,23.598398219562938,22.330611889215547,27.505888906799534,34.337838170377935,2.3836085704864502E7,3.20733132298584E7,2.503790437837982E7,2.2659895963564873E7,2.175740087420273E7,2.4451608012176514E7,2.1933001734852314E7,5.59967034604629E7,5.737188052299309E7,6.208721474336243E7,8.284991027382469E7,8.897031933202875E7] > [Doors canno,294242,Fairview,Williamson County,TN,United > States,DHL,BARIAN,2001,6355232.185385704,1.0198920296742141E7,1.0246200903741479E7,1.2209716492156029E7,8566998.262890816,8806316.75278151,9789405.6993227,1.646658496404171E7,2.6443785668474197E7,2.701604788320923E7,3.366058958298761E7,2.7462468750599384E7,21.59865751791282,34.66167405313361,34.822360178837414,41.495491779406166,29.115484067165177,29.928823053070296,33.26991285854059,55.96272783641258,89.87087386734116,91.81574310672585,114.39763726112386,93.33293258813964,2.2645142994330406E7,2.448725452685547E7,2.4925759290207863E7,3.0503655031727314E7,2.6558160276379585E7,2.0976233452690125E7,2.9895796101181984E7,5.600219855566597E7,5.348815865275085E7,7.628723580410767E7,8.248374754962921E7,8.808826726185608E7] > [Important issues liv,138504,Fairview,Williamson County,TN,United >
[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202412#comment-15202412 ] Xin Wu edited comment on SPARK-13832 at 3/19/16 12:29 AM: -- Jesse, you are right.. With "grouping" function, the query hits the {code}com.esotericsoftware.kryo.KryoException{code}, even with text data file. So this case, we passed the analyzer. With grouping_id on column i_category, the query hits the analyzer issue. {code}Error in query: Columns of grouping_id...{code} I will continue digging in. was (Author: xwu0226): Jesse, you are right.. With "grouping" function, the query hits the {code}com.esotericsoftware.kryo.KryoException{code}, even thought with text data file. So this case, we passed the analyzer. With grouping_id on column i_category, the query hits the analyzer issue. {code}Error in query: Columns of grouping_id...{code} I will continue digging in. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202412#comment-15202412 ] Xin Wu commented on SPARK-13832: Jesse, you are right.. With "grouping" function, the query hits the {code}com.esotericsoftware.kryo.KryoException{code}, even thought with text data file. So this case, we passed the analyzer. With grouping_id on column i_category, the query hits the analyzer issue. {code}Error in query: Columns of grouping_id...{code} I will continue digging in. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202264#comment-15202264 ] Xin Wu commented on SPARK-13832: what i meant is that in Spark 2.0, it seems that "grouping__id" is deprecated and grouping_id() is used. So i needed to change this to proceed. but after the query is parsed, the AnalysisException you reported in this JIRA {code}"org.apache.spark.sql.AnalysisException: expression 'i_category'..."{code} is not reproducible. As far as the later execution error, I am still validating whether it is related to the data or spark sql execution issue. But this is not a parser or analyzer error. In 1.6, the AnalsysException is reproducible. This this is no longer the issue in 2.0.. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196377#comment-15196377 ] Xin Wu edited comment on SPARK-13832 at 3/16/16 12:39 AM: -- Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is not reproducible in spark 2.0.. Except that I saw execution error related to com.esotericsoftware.kryo.KryoException was (Author: xwu0226): Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is not reproducible in spark 2.0.. Except that I saw execution error maybe related to spark-13862. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196377#comment-15196377 ] Xin Wu edited comment on SPARK-13832 at 3/15/16 10:43 PM: -- Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is not reproducible in spark 2.0.. Except that I saw execution error related to spark-13862. was (Author: xwu0226): Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is gone.. Except that I saw execution error related to kryo.serializers.. that should be a different issue and maybe related to my setup. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13832) TPC-DS Query 36 fails with Parser error
[ https://issues.apache.org/jira/browse/SPARK-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196377#comment-15196377 ] Xin Wu edited comment on SPARK-13832 at 3/15/16 10:44 PM: -- Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is not reproducible in spark 2.0.. Except that I saw execution error maybe related to spark-13862. was (Author: xwu0226): Trying this query in Spark 2.0 and I needed to change grouping__id to grouping_id() to pass the parser. The reported error is not reproducible in spark 2.0.. Except that I saw execution error related to spark-13862. > TPC-DS Query 36 fails with Parser error > --- > > Key: SPARK-13832 > URL: https://issues.apache.org/jira/browse/SPARK-13832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS query 36 fails with the following error > Analyzer error: 16/02/28 21:22:51 INFO parse.ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: expression > 'i_category' is neither present in the group by, nor is it an aggregate > function. Add to group by or wrap in first() (or first_value) if you don't > care which value you get.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > Query Text pasted here for quick reference. > select > sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin >,i_category >,i_class >,grouping__id as lochierarchy >,rank() over ( > partition by grouping__id, > case when grouping__id = 0 then i_category end > order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc) as > rank_within_parent > from > store_sales >,date_dim d1 >,item >,store > where > d1.d_year = 2001 > and d1.d_date_sk = ss_sold_date_sk > and i_item_sk = ss_item_sk > and s_store_sk = ss_store_sk > and s_state in ('TN','TN','TN','TN', > 'TN','TN','TN','TN') > group by i_category,i_class WITH ROLLUP > order by >lochierarchy desc > ,case when lochierarchy = 0 then i_category end > ,rank_within_parent > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org