[jira] [Commented] (SPARK-43865) spark cluster deploy mode cannot initialize metastore java.sql.SQLException: No suitable driver found for jdbc:mysql
[ https://issues.apache.org/jira/browse/SPARK-43865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728234#comment-17728234 ] pin_zhang commented on SPARK-43865: --- It's not convinient to upload jar to all worker nodes, because the jars is dymanically issued accoding to configuration. > spark cluster deploy mode cannot initialize metastore java.sql.SQLException: > No suitable driver found for jdbc:mysql > > > Key: SPARK-43865 > URL: https://issues.apache.org/jira/browse/SPARK-43865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: pin_zhang >Priority: Major > > 1. Test with JDK 11 + SPARK340 > object BugHS { > def main(args: Array[String]): Unit = { > val conf = new SparkConf() > > conf.set("javax.jdo.option.ConnectionURL","jdbc:mysql://mysql:3306/hive_ms_spark3?useSSL=false") > conf.set("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver") > conf.set("javax.jdo.option.ConnectionUserName","**") > conf.set("javax.jdo.option.ConnectionPassword","**") > conf.set("spark.sql.hive.thriftServer.singleSession","false") > conf.set("spark.sql.warehouse.dir","hdfs://hadoop/warehouse_spark3") > import org.apache.spark.sql.SparkSession > val spark = SparkSession > .builder() > .appName("Test").config(conf).enableHiveSupport() > .getOrCreate() > HiveThriftServer2.startWithContext(spark.sqlContext) > spark.sql("create table IF NOT EXISTS test2 (id int) USING parquet") > } > } > 2. Submit in cluster mode >a. spark_config.properties > spark.master=spark://master:6066 > > spark.jars=hdfs://hadoop/tmp/test_bug/mysql-connector-java-5.1.47.jar > spark.master.rest.enabled=true >b. spark-submit2.cmd --deploy-mode cluster --properties-file > spark_config.properties --class com.test.BugHS > "hdfs://hadoop/tmp/test_bug/bug_classloader.jar" > 3. Meet "No suitable driver found" exception, caused by classloader is > different for driver in spark.jars and metastore jar in JDK 11 > java.sql.SQLException: Unable to open a test connection to the given > database. JDBC url = jdbc:mysql://mysql:3306/hive_ms_spark3?useSSL=false, > username = root. Terminating connection pool (set lazyInit to true if you > expect to start your database after your app). Original Exception: -- > java.sql.SQLException: No suitable driver found for > jdbc:mysql://mysql:3306/hive_ms_spark3?useSSL=false > at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:702) > at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:189) > at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361) > at com.jolbox.bonecp.BoneCP.(BoneCP.java:416) > at > com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:483) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:297) > at > jdk.internal.reflect.GeneratedConstructorAccessor77.newInstance(Unknown > Source) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43865) spark cluster deploy mode cannot initialize metastore java.sql.SQLException: No suitable driver found for jdbc:mysql
[ https://issues.apache.org/jira/browse/SPARK-43865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-43865: -- Description: 1. Test with JDK 11 + SPARK340 object BugHS { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.set("javax.jdo.option.ConnectionURL","jdbc:mysql://mysql:3306/hive_ms_spark3?useSSL=false") conf.set("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver") conf.set("javax.jdo.option.ConnectionUserName","**") conf.set("javax.jdo.option.ConnectionPassword","**") conf.set("spark.sql.hive.thriftServer.singleSession","false") conf.set("spark.sql.warehouse.dir","hdfs://hadoop/warehouse_spark3") import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("Test").config(conf).enableHiveSupport() .getOrCreate() HiveThriftServer2.startWithContext(spark.sqlContext) spark.sql("create table IF NOT EXISTS test2 (id int) USING parquet") } } 2. Submit in cluster mode a. spark_config.properties spark.master=spark://master:6066 spark.jars=hdfs://hadoop/tmp/test_bug/mysql-connector-java-5.1.47.jar spark.master.rest.enabled=true b. spark-submit2.cmd --deploy-mode cluster --properties-file spark_config.properties --class com.test.BugHS "hdfs://hadoop/tmp/test_bug/bug_classloader.jar" 3. Meet "No suitable driver found" exception, caused by classloader is different for driver in spark.jars and metastore jar in JDK 11 java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://mysql:3306/hive_ms_spark3?useSSL=false, username = root. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: -- java.sql.SQLException: No suitable driver found for jdbc:mysql://mysql:3306/hive_ms_spark3?useSSL=false at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:702) at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:189) at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361) at com.jolbox.bonecp.BoneCP.(BoneCP.java:416) at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120) at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:483) at org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:297) at jdk.internal.reflect.GeneratedConstructorAccessor77.newInstance(Unknown Source) was: 1. Test with JDK 11 + SPARK340 object BugHS { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.set("javax.jdo.option.ConnectionURL","jdbc:mysql://mysql:3306/hive_ms_spark3?useSSL=false") conf.set("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver") conf.set("javax.jdo.option.ConnectionUserName","**") conf.set("javax.jdo.option.ConnectionPassword","**") conf.set("spark.sql.hive.thriftServer.singleSession","false") conf.set("spark.sql.warehouse.dir","hdfs://hadoop/warehouse_spark3") import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("Test").config(conf).enableHiveSupport() .getOrCreate() HiveThriftServer2.startWithContext(spark.sqlContext) spark.sql("create table IF NOT EXISTS test2 (id int) USING parquet") } } 3. Submit in cluster mode spark.master=spark\://10.111.7.150\:6066 spark.jars=hdfs\://10.111.7.150\:8020/tmp/test_bug/mysql-connector-java-5.1.47.jar spark.master.rest.enabled=true > spark cluster deploy mode cannot initialize metastore java.sql.SQLException: > No suitable driver found for jdbc:mysql > > > Key: SPARK-43865 > URL: https://issues.apache.org/jira/browse/SPARK-43865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: pin_zhang >Priority: Major > > 1. Test with JDK 11 + SPARK340 > object BugHS { > def main(args: Array[String]): Unit = { > val conf = new SparkConf() > > conf.set("javax.jdo.option.ConnectionURL","jdbc:mysql://mysql:3306/hive_ms_spark3?useSSL=false") > conf.set("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver") > conf.set("javax.jdo.option.ConnectionUserName","**") > conf.set("javax.jdo.option.ConnectionPassword","**") > conf.set("spark.sql.hive.thriftServer.singleSession","false") > conf.set("spark.sql.warehouse.dir","hdfs://hadoop/warehouse_spark3") > import org.apache.spark.sql.SparkSession > val spark = SparkSession > .builder() >
[jira] [Updated] (SPARK-43865) spark cluster deploy mode cannot initialize metastore java.sql.SQLException: No suitable driver found for jdbc:mysql
[ https://issues.apache.org/jira/browse/SPARK-43865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-43865: -- Description: 1. Test with JDK 11 + SPARK340 object BugHS { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.set("javax.jdo.option.ConnectionURL","jdbc:mysql://mysql:3306/hive_ms_spark3?useSSL=false") conf.set("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver") conf.set("javax.jdo.option.ConnectionUserName","**") conf.set("javax.jdo.option.ConnectionPassword","**") conf.set("spark.sql.hive.thriftServer.singleSession","false") conf.set("spark.sql.warehouse.dir","hdfs://hadoop/warehouse_spark3") import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("Test").config(conf).enableHiveSupport() .getOrCreate() HiveThriftServer2.startWithContext(spark.sqlContext) spark.sql("create table IF NOT EXISTS test2 (id int) USING parquet") } } 3. Submit in cluster mode spark.master=spark\://10.111.7.150\:6066 spark.jars=hdfs\://10.111.7.150\:8020/tmp/test_bug/mysql-connector-java-5.1.47.jar spark.master.rest.enabled=true > spark cluster deploy mode cannot initialize metastore java.sql.SQLException: > No suitable driver found for jdbc:mysql > > > Key: SPARK-43865 > URL: https://issues.apache.org/jira/browse/SPARK-43865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: pin_zhang >Priority: Major > > 1. Test with JDK 11 + SPARK340 > object BugHS { > def main(args: Array[String]): Unit = { > val conf = new SparkConf() > > conf.set("javax.jdo.option.ConnectionURL","jdbc:mysql://mysql:3306/hive_ms_spark3?useSSL=false") > conf.set("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver") > conf.set("javax.jdo.option.ConnectionUserName","**") > conf.set("javax.jdo.option.ConnectionPassword","**") > conf.set("spark.sql.hive.thriftServer.singleSession","false") > conf.set("spark.sql.warehouse.dir","hdfs://hadoop/warehouse_spark3") > import org.apache.spark.sql.SparkSession > val spark = SparkSession > .builder() > .appName("Test").config(conf).enableHiveSupport() > .getOrCreate() > HiveThriftServer2.startWithContext(spark.sqlContext) > spark.sql("create table IF NOT EXISTS test2 (id int) USING parquet") > } > } > 3. Submit in cluster mode > spark.master=spark\://10.111.7.150\:6066 > spark.jars=hdfs\://10.111.7.150\:8020/tmp/test_bug/mysql-connector-java-5.1.47.jar > spark.master.rest.enabled=true -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43865) spark cluster deploy mode cannot initialize metastore java.sql.SQLException: No suitable driver found for jdbc:mysql
pin_zhang created SPARK-43865: - Summary: spark cluster deploy mode cannot initialize metastore java.sql.SQLException: No suitable driver found for jdbc:mysql Key: SPARK-43865 URL: https://issues.apache.org/jira/browse/SPARK-43865 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: pin_zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41168) Spark Master OOM when Worker No space left on device
[ https://issues.apache.org/jira/browse/SPARK-41168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-41168: -- Description: Spark master + 1 Spark workers (One is No space left on device) 1. Submit a app with 2 instances 2. Stop the good worker 3. Spark master launch executor continously Bad Spark worker cannot create folder. Result in large number of executors kept in spark master memory 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93441 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93441 because it is FAILED 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93442 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93442 because it is FAILED was: Spark master + 1 Spark workers (One is No space left on device) 1. Submit a app with 2 instances 2. Stop the good worker 3. Spark master launch executor continously Result in large number of executors kept in spark master memory 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93441 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93441 because it is FAILED 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93442 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93442 because it is FAILED > Spark Master OOM when Worker No space left on device > > > Key: SPARK-41168 > URL: https://issues.apache.org/jira/browse/SPARK-41168 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > Spark master + 1 Spark workers (One is No space left on device) > 1. Submit a app with 2 instances > 2. Stop the good worker > 3. Spark master launch executor continously > Bad Spark worker cannot create folder. > Result in large number of executors kept in spark master memory > 2022-11-15 22:21:43 INFO Master:54 - Launching executor > app-20221115221952-0016/93441 on worker > worker-20221115202400-10.111.1.10-40011 > 2022-11-15 22:21:43 INFO Master:54 - Removing executor > app-20221115221952-0016/93441 because it is FAILED > 2022-11-15 22:21:43 INFO Master:54 - Launching executor > app-20221115221952-0016/93442 on worker > worker-20221115202400-10.111.1.10-40011 > 2022-11-15 22:21:43 INFO Master:54 - Removing executor > app-20221115221952-0016/93442 because it is FAILED -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41168) Spark Master OOM when Worker No space left on device
[ https://issues.apache.org/jira/browse/SPARK-41168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-41168: -- Description: Spark master + 1 Spark workers (One is No space left on device) 1. Submit a app with 2 instances 2. Stop the good worker 3. Spark master launch executor continously Result in large number of executors kept in spark master memory 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93441 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93441 because it is FAILED 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93442 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93442 because it is FAILED was: Spark master + 1 Spark workers (One is ) Submit a app with 2 instance, 1. Spark worker SPARK-41168 Caused by large number of executors kept in spark master memory 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93441 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93441 because it is FAILED 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93442 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93442 because it is FAILED > Spark Master OOM when Worker No space left on device > > > Key: SPARK-41168 > URL: https://issues.apache.org/jira/browse/SPARK-41168 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > Spark master + 1 Spark workers (One is No space left on device) > 1. Submit a app with 2 instances > 2. Stop the good worker > 3. Spark master launch executor continously > Result in large number of executors kept in spark master memory > 2022-11-15 22:21:43 INFO Master:54 - Launching executor > app-20221115221952-0016/93441 on worker > worker-20221115202400-10.111.1.10-40011 > 2022-11-15 22:21:43 INFO Master:54 - Removing executor > app-20221115221952-0016/93441 because it is FAILED > 2022-11-15 22:21:43 INFO Master:54 - Launching executor > app-20221115221952-0016/93442 on worker > worker-20221115202400-10.111.1.10-40011 > 2022-11-15 22:21:43 INFO Master:54 - Removing executor > app-20221115221952-0016/93442 because it is FAILED -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41168) Spark Master OOM when Worker No space left on device
[ https://issues.apache.org/jira/browse/SPARK-41168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-41168: -- Description: Spark master + 1 Spark workers (One is ) Submit a app with 2 instance, 1. Spark worker SPARK-41168 Caused by large number of executors kept in spark master memory 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93441 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93441 because it is FAILED 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93442 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93442 because it is FAILED was: 1. Spark worker Caused by large number of executors kept in spark master memory 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93441 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93441 because it is FAILED 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93442 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93442 because it is FAILED > Spark Master OOM when Worker No space left on device > > > Key: SPARK-41168 > URL: https://issues.apache.org/jira/browse/SPARK-41168 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > Spark master + 1 Spark workers (One is ) > Submit a app with 2 instance, > 1. Spark worker SPARK-41168 > Caused by large number of executors kept in spark master memory > 2022-11-15 22:21:43 INFO Master:54 - Launching executor > app-20221115221952-0016/93441 on worker > worker-20221115202400-10.111.1.10-40011 > 2022-11-15 22:21:43 INFO Master:54 - Removing executor > app-20221115221952-0016/93441 because it is FAILED > 2022-11-15 22:21:43 INFO Master:54 - Launching executor > app-20221115221952-0016/93442 on worker > worker-20221115202400-10.111.1.10-40011 > 2022-11-15 22:21:43 INFO Master:54 - Removing executor > app-20221115221952-0016/93442 because it is FAILED -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41168) Spark Master OOM when Worker No space left on device
[ https://issues.apache.org/jira/browse/SPARK-41168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-41168: -- Description: 1. Spark worker Caused by large number of executors kept in spark master memory 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93441 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93441 because it is FAILED 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93442 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93442 because it is FAILED was: Caused by large number of executors kept in spark master memory 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93441 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93441 because it is FAILED 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93442 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93442 because it is FAILED > Spark Master OOM when Worker No space left on device > > > Key: SPARK-41168 > URL: https://issues.apache.org/jira/browse/SPARK-41168 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > 1. Spark worker > Caused by large number of executors kept in spark master memory > 2022-11-15 22:21:43 INFO Master:54 - Launching executor > app-20221115221952-0016/93441 on worker > worker-20221115202400-10.111.1.10-40011 > 2022-11-15 22:21:43 INFO Master:54 - Removing executor > app-20221115221952-0016/93441 because it is FAILED > 2022-11-15 22:21:43 INFO Master:54 - Launching executor > app-20221115221952-0016/93442 on worker > worker-20221115202400-10.111.1.10-40011 > 2022-11-15 22:21:43 INFO Master:54 - Removing executor > app-20221115221952-0016/93442 because it is FAILED -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41168) Spark Master OOM when Worker No space left on device
pin_zhang created SPARK-41168: - Summary: Spark Master OOM when Worker No space left on device Key: SPARK-41168 URL: https://issues.apache.org/jira/browse/SPARK-41168 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.3.1 Reporter: pin_zhang Caused by large number of executors kept in spark master memory 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93441 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93441 because it is FAILED 2022-11-15 22:21:43 INFO Master:54 - Launching executor app-20221115221952-0016/93442 on worker worker-20221115202400-10.111.1.10-40011 2022-11-15 22:21:43 INFO Master:54 - Removing executor app-20221115221952-0016/93442 because it is FAILED -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33946) Cannot connect to spark hive after session timeout
[ https://issues.apache.org/jira/browse/SPARK-33946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-33946: -- Description: Test with hive setting: * hive.server2.idle.session.timeout=6 * hive.server2.session.check.interval=1 * hive.server2.thrift.max.worker.threads=2 * hive.server2.thrift.min.worker.threads=1 1. Connect with user test1 and hold the connection. 2. Wait 1 minute, Connect with user test2 3. Seen from web ui http://localhost:4040/sqlserver/, wait session for test1 finished 4. Connect with user test3, connect was refused Task has been rejected by ExecutorService 10 times till timedout, reason: java.util.concurrent.RejectedExecutionException: Task org.apache.thrift.server.TThreadPoolServer$WorkerProcess@71b1ea28 rejected from java.util.concurrent.ThreadPoolExecutor@716d9872[Running, pool size = 2, active threads = 2, queued tasks = 0, completed tasks = 2] Seems the session was not removed, while UI show finished time for session was: Test with hive setting: * hive.server2.idle.session.timeout=6 * hive.server2.session.check.interval=1 * hive.server2.thrift.max.worker.threads=2 * hive.server2.thrift.min.worker.threads=1 1. Connect with user test1 and hold the connection. 2. Wait 1 minute, Connect to user test2 3. Seen from web ui http://localhost:4040/sqlserver/, session for test1 finished 4. Connect with user test3, connect was refused Task has been rejected by ExecutorService 10 times till timedout, reason: java.util.concurrent.RejectedExecutionException: Task org.apache.thrift.server.TThreadPoolServer$WorkerProcess@71b1ea28 rejected from java.util.concurrent.ThreadPoolExecutor@716d9872[Running, pool size = 2, active threads = 2, queued tasks = 0, completed tasks = 2] Seems the session was not removed, while UI show finished time for session > Cannot connect to spark hive after session timeout > -- > > Key: SPARK-33946 > URL: https://issues.apache.org/jira/browse/SPARK-33946 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > Test with hive setting: > * hive.server2.idle.session.timeout=6 > * hive.server2.session.check.interval=1 > * hive.server2.thrift.max.worker.threads=2 > * hive.server2.thrift.min.worker.threads=1 > 1. Connect with user test1 and hold the connection. > 2. Wait 1 minute, Connect with user test2 > 3. Seen from web ui http://localhost:4040/sqlserver/, wait session for > test1 finished > 4. Connect with user test3, connect was refused > Task has been rejected by ExecutorService 10 times till timedout, reason: > java.util.concurrent.RejectedExecutionException: Task > org.apache.thrift.server.TThreadPoolServer$WorkerProcess@71b1ea28 rejected > from java.util.concurrent.ThreadPoolExecutor@716d9872[Running, pool size = 2, > active threads = 2, queued tasks = 0, completed tasks = 2] > Seems the session was not removed, while UI show finished time for session > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33946) Cannot connect to spark hive after session timeout
pin_zhang created SPARK-33946: - Summary: Cannot connect to spark hive after session timeout Key: SPARK-33946 URL: https://issues.apache.org/jira/browse/SPARK-33946 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: pin_zhang Test with hive setting: * hive.server2.idle.session.timeout=6 * hive.server2.session.check.interval=1 * hive.server2.thrift.max.worker.threads=2 * hive.server2.thrift.min.worker.threads=1 1. Connect with user test1 and hold the connection. 2. Wait 1 minute, Connect to user test2 3. Seen from web ui http://localhost:4040/sqlserver/, session for test1 finished 4. Connect with user test3, connect was refused Task has been rejected by ExecutorService 10 times till timedout, reason: java.util.concurrent.RejectedExecutionException: Task org.apache.thrift.server.TThreadPoolServer$WorkerProcess@71b1ea28 rejected from java.util.concurrent.ThreadPoolExecutor@716d9872[Running, pool size = 2, active threads = 2, queued tasks = 0, completed tasks = 2] Seems the session was not removed, while UI show finished time for session -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25804) JDOPersistenceManager leak when query via JDBC
[ https://issues.apache.org/jira/browse/SPARK-25804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070938#comment-17070938 ] pin_zhang commented on SPARK-25804: --- Any comments on this issue? > JDOPersistenceManager leak when query via JDBC > -- > > Key: SPARK-25804 > URL: https://issues.apache.org/jira/browse/SPARK-25804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > Attachments: image-2018-10-27-01-44-07-972.png > > > 1. start-thriftserver.sh under SPARK2.3.1 > 2. Create Table and insert values > create table test_leak (id string, index int); > insert into test_leak values('id1',1) > 3. Create JDBC Client query the table > import java.sql.*; > public class HiveClient { > public static void main(String[] args) throws Exception { > String driverName = "org.apache.hive.jdbc.HiveDriver"; > Class.forName(driverName); > Connection con = DriverManager.getConnection( > "jdbc:hive2://localhost:1/default", "test", "test"); > Statement stmt = con.createStatement(); > String sql = "select * from test_leak"; > int loop = 100; > while ( loop – > 0) { > ResultSet rs = stmt.executeQuery(sql); > rs.next(); > System.out.println(new java.sql.Timestamp(System.currentTimeMillis()) +" > : " + rs.getString(1)); > rs.close(); > if( loop % 100 ==0){ > Thread.sleep(1); > } > } > con.close(); > } > } > 4. Dump HS2 heap org.datanucleus.api.jdo.JDOPersistenceManager instances keep > increasing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
[ https://issues.apache.org/jira/browse/SPARK-29423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949187#comment-16949187 ] pin_zhang commented on SPARK-29423: --- The same result on Spark 2.4.3. > leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus > --- > > Key: SPARK-29423 > URL: https://issues.apache.org/jira/browse/SPARK-29423 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > 1. start server with start-thriftserver.sh > 2. JDBC client connect and disconnect to hiveserver2 > for (int i = 0; i < 1; i++) { >Connection conn = > DriverManager.getConnection("jdbc:hive2://localhost:1", "test", ""); >conn.close(); > } > 3. instance of > org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep > increasing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29423) leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus
pin_zhang created SPARK-29423: - Summary: leak on org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus Key: SPARK-29423 URL: https://issues.apache.org/jira/browse/SPARK-29423 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: pin_zhang 1. start server with start-thriftserver.sh 2. JDBC client connect and disconnect to hiveserver2 for (int i = 0; i < 1; i++) { Connection conn = DriverManager.getConnection("jdbc:hive2://localhost:1", "test", ""); conn.close(); } 3. instance of org.apache.spark.sql.execution.streaming.StreamingQueryListenerBus keep increasing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863796#comment-16863796 ] pin_zhang commented on SPARK-21067: --- We also encounter this issue, any plan to fix this bug? > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.4.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard >Priority: Major > Attachments: SPARK-21067.patch > > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.Dataset.(Dataset.scala:185) > at
[jira] [Reopened] (SPARK-27600) Unable to start Spark Hive Thrift Server when multiple hive server server share the same metastore
[ https://issues.apache.org/jira/browse/SPARK-27600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang reopened SPARK-27600: --- The issue is not resolved > Unable to start Spark Hive Thrift Server when multiple hive server server > share the same metastore > -- > > Key: SPARK-27600 > URL: https://issues.apache.org/jira/browse/SPARK-27600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > When start ten or more spark hive thrift servers at the same time, more than > one version saved to table VERSION when meet exception WARN > [DataNucleus.Query] (main:) Query for candidates of > org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted > in no possible candidates > Exception thrown obtaining schema column information from datastore > org.datanucleus.exceptions.NucleusDataStoreException: Exception thrown > obtaining schema column information from datastore > Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table > 'via_ms.deleteme1556239494724' doesn't exist > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at com.mysql.jdbc.Util.handleNewInstance(Util.java:425) > at com.mysql.jdbc.Util.getInstance(Util.java:408) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2491) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2449) > at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1381) > at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2441) > at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2339) > at com.mysql.jdbc.IterateBlock.doForAll(IterateBlock.java:50) > at com.mysql.jdbc.DatabaseMetaData.getColumns(DatabaseMetaData.java:2337) > at > org.apache.commons.dbcp.DelegatingDatabaseMetaData.getColumns(DelegatingDatabaseMetaData.java:218) > at > org.datanucleus.store.rdbms.adapter.BaseDatastoreAdapter.getColumns(BaseDatastoreAdapter.java:1532) > at > org.datanucleus.store.rdbms.schema.RDBMSSchemaHandler.refreshTableData(RDBMSSchemaHandler.java:921) > Then cannot start hive server any more because of > MetaException(message:Metastore contains multiple versions (2) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27600) Unable to start Spark Hive Thrift Server when multiple hive server server share the same metastore
[ https://issues.apache.org/jira/browse/SPARK-27600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835437#comment-16835437 ] pin_zhang edited comment on SPARK-27600 at 5/8/19 9:02 AM: --- [~hyukjin.kwon] I think this is relate to a hive bug https://issues.apache.org/jira/browse/HIVE-6113 It shows "The exception appears when there are several processes working with Hive concurrently." In hive's fix upgrade third-party datanucleus. Is it a spark's bug if spark use the hive 1.2.1? was (Author: pin_zhang): I think this is relate to a hive bug https://issues.apache.org/jira/browse/HIVE-6113 It shows "The exception appears when there are several processes working with Hive concurrently." In hive's fix upgrade third-party datanucleus. Is it a spark's bug if spark use the hive 1.2.1? > Unable to start Spark Hive Thrift Server when multiple hive server server > share the same metastore > -- > > Key: SPARK-27600 > URL: https://issues.apache.org/jira/browse/SPARK-27600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > When start ten or more spark hive thrift servers at the same time, more than > one version saved to table VERSION when meet exception WARN > [DataNucleus.Query] (main:) Query for candidates of > org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted > in no possible candidates > Exception thrown obtaining schema column information from datastore > org.datanucleus.exceptions.NucleusDataStoreException: Exception thrown > obtaining schema column information from datastore > Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table > 'via_ms.deleteme1556239494724' doesn't exist > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at com.mysql.jdbc.Util.handleNewInstance(Util.java:425) > at com.mysql.jdbc.Util.getInstance(Util.java:408) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2491) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2449) > at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1381) > at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2441) > at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2339) > at com.mysql.jdbc.IterateBlock.doForAll(IterateBlock.java:50) > at com.mysql.jdbc.DatabaseMetaData.getColumns(DatabaseMetaData.java:2337) > at > org.apache.commons.dbcp.DelegatingDatabaseMetaData.getColumns(DelegatingDatabaseMetaData.java:218) > at > org.datanucleus.store.rdbms.adapter.BaseDatastoreAdapter.getColumns(BaseDatastoreAdapter.java:1532) > at > org.datanucleus.store.rdbms.schema.RDBMSSchemaHandler.refreshTableData(RDBMSSchemaHandler.java:921) > Then cannot start hive server any more because of > MetaException(message:Metastore contains multiple versions (2) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27600) Unable to start Spark Hive Thrift Server when multiple hive server server share the same metastore
[ https://issues.apache.org/jira/browse/SPARK-27600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835437#comment-16835437 ] pin_zhang commented on SPARK-27600: --- I think this is relate to a hive bug https://issues.apache.org/jira/browse/HIVE-6113 It shows "The exception appears when there are several processes working with Hive concurrently." In hive's fix upgrade third-party datanucleus. Is it a spark's bug if spark use the hive 1.2.1? > Unable to start Spark Hive Thrift Server when multiple hive server server > share the same metastore > -- > > Key: SPARK-27600 > URL: https://issues.apache.org/jira/browse/SPARK-27600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > When start ten or more spark hive thrift servers at the same time, more than > one version saved to table VERSION when meet exception WARN > [DataNucleus.Query] (main:) Query for candidates of > org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted > in no possible candidates > Exception thrown obtaining schema column information from datastore > org.datanucleus.exceptions.NucleusDataStoreException: Exception thrown > obtaining schema column information from datastore > Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table > 'via_ms.deleteme1556239494724' doesn't exist > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at com.mysql.jdbc.Util.handleNewInstance(Util.java:425) > at com.mysql.jdbc.Util.getInstance(Util.java:408) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2491) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2449) > at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1381) > at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2441) > at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2339) > at com.mysql.jdbc.IterateBlock.doForAll(IterateBlock.java:50) > at com.mysql.jdbc.DatabaseMetaData.getColumns(DatabaseMetaData.java:2337) > at > org.apache.commons.dbcp.DelegatingDatabaseMetaData.getColumns(DelegatingDatabaseMetaData.java:218) > at > org.datanucleus.store.rdbms.adapter.BaseDatastoreAdapter.getColumns(BaseDatastoreAdapter.java:1532) > at > org.datanucleus.store.rdbms.schema.RDBMSSchemaHandler.refreshTableData(RDBMSSchemaHandler.java:921) > Then cannot start hive server any more because of > MetaException(message:Metastore contains multiple versions (2) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27553) Operation log is not closed when close session
[ https://issues.apache.org/jira/browse/SPARK-27553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833044#comment-16833044 ] pin_zhang commented on SPARK-27553: --- The operation log is not closed when close the session > Operation log is not closed when close session > -- > > Key: SPARK-27553 > URL: https://issues.apache.org/jira/browse/SPARK-27553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > On Window > 1. start spark-shell > 2. start hive server in shell by > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark.sqlContext) > 3. beeline connect to hive server > 3.1 connect > beeline -u jdbc:hive2://localhost:1 > 3.2 Run SQL > show tables; > 3.3 quit beeline > !quit > Get exception log > {code} > 19/04/24 11:38:22 ERROR HiveSessionImpl: Failed to cleanup ses > sion log dir: SessionHandle [5827428b-d140-4fc0-8ad4-721c39b3ead0] > java.io.IOException: Unable to delete file: > C:\Users\test\AppData\Local\Temp\test\operation_logs\5827428b-d140-4fc0-8ad4-721c39b3ead0\df9cd631-66e7-4303-9a4 > 1-a09bdefcf888 > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) > at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270) > at > org.apache.hive.service.cli.session.HiveSessionImpl.cleanupSessionLogDir(HiveSessionImpl.java:671) > at > org.apache.hive.service.cli.session.HiveSessionImpl.close(HiveSessionImpl.java:643) > at > org.apache.hive.service.cli.session.HiveSessionImplwithUGI.close(HiveSessionImplwithUGI.java:109) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy19.close(Unknown Source) > at > org.apache.hive.service.cli.session.SessionManager.closeSession(SessionManager.java:280) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLSessionManager.closeSession(SparkSQLSessionManager.scala:76) > at org.apache.hive.service.cli.CLIService.closeSession(CLIService.java:237) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.CloseSession(ThriftCLIService.java:397) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1273) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1258) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27600) Unable to start Spark Hive Thrift Server when multiple hive server server share the same metastore
pin_zhang created SPARK-27600: - Summary: Unable to start Spark Hive Thrift Server when multiple hive server server share the same metastore Key: SPARK-27600 URL: https://issues.apache.org/jira/browse/SPARK-27600 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: pin_zhang When start ten or more spark hive thrift servers at the same time, more than one version saved to table VERSION when meet exception WARN [DataNucleus.Query] (main:) Query for candidates of org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted in no possible candidates Exception thrown obtaining schema column information from datastore org.datanucleus.exceptions.NucleusDataStoreException: Exception thrown obtaining schema column information from datastore Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table 'via_ms.deleteme1556239494724' doesn't exist at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.mysql.jdbc.Util.handleNewInstance(Util.java:425) at com.mysql.jdbc.Util.getInstance(Util.java:408) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2491) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2449) at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1381) at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2441) at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2339) at com.mysql.jdbc.IterateBlock.doForAll(IterateBlock.java:50) at com.mysql.jdbc.DatabaseMetaData.getColumns(DatabaseMetaData.java:2337) at org.apache.commons.dbcp.DelegatingDatabaseMetaData.getColumns(DelegatingDatabaseMetaData.java:218) at org.datanucleus.store.rdbms.adapter.BaseDatastoreAdapter.getColumns(BaseDatastoreAdapter.java:1532) at org.datanucleus.store.rdbms.schema.RDBMSSchemaHandler.refreshTableData(RDBMSSchemaHandler.java:921) Then cannot start hive server any more because of MetaException(message:Metastore contains multiple versions (2) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27553) Operation log is not closed when close session
pin_zhang created SPARK-27553: - Summary: Operation log is not closed when close session Key: SPARK-27553 URL: https://issues.apache.org/jira/browse/SPARK-27553 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: pin_zhang On Window 1. start spark-shell 2. start hive server in shell by org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark.sqlContext) 3. beeline connect to hive server 3.1 connect beeline -u jdbc:hive2://localhost:1 3.2 Run SQL show tables; 3.3 quit beeline !quit Get exception log 19/04/24 11:38:22 ERROR HiveSessionImpl: Failed to cleanup ses sion log dir: SessionHandle [5827428b-d140-4fc0-8ad4-721c39b3ead0] java.io.IOException: Unable to delete file: C:\Users\test\AppData\Local\Temp\test\operation_logs\5827428b-d140-4fc0-8ad4-721c39b3ead0\df9cd631-66e7-4303-9a4 1-a09bdefcf888 at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270) at org.apache.hive.service.cli.session.HiveSessionImpl.cleanupSessionLogDir(HiveSessionImpl.java:671) at org.apache.hive.service.cli.session.HiveSessionImpl.close(HiveSessionImpl.java:643) at org.apache.hive.service.cli.session.HiveSessionImplwithUGI.close(HiveSessionImplwithUGI.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) at com.sun.proxy.$Proxy19.close(Unknown Source) at org.apache.hive.service.cli.session.SessionManager.closeSession(SessionManager.java:280) at org.apache.spark.sql.hive.thriftserver.SparkSQLSessionManager.closeSession(SparkSQLSessionManager.scala:76) at org.apache.hive.service.cli.CLIService.closeSession(CLIService.java:237) at org.apache.hive.service.cli.thrift.ThriftCLIService.CloseSession(ThriftCLIService.java:397) at org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1273) at org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1258) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25804) JDOPersistenceManager leak when query via JDBC
[ https://issues.apache.org/jira/browse/SPARK-25804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-25804: -- Description: 1. start-thriftserver.sh under SPARK2.3.1 2. Create Table and insert values create table test_leak (id string, index int); insert into test_leak values('id1',1) 3. Create JDBC Client query the table import java.sql.*; public class HiveClient { public static void main(String[] args) throws Exception { String driverName = "org.apache.hive.jdbc.HiveDriver"; Class.forName(driverName); Connection con = DriverManager.getConnection( "jdbc:hive2://localhost:1/default", "test", "test"); Statement stmt = con.createStatement(); String sql = "select * from test_leak"; int loop = 100; while ( loop – > 0) { ResultSet rs = stmt.executeQuery(sql); rs.next(); System.out.println(new java.sql.Timestamp(System.currentTimeMillis()) +" : " + rs.getString(1)); rs.close(); if( loop % 100 ==0){ Thread.sleep(1); } } con.close(); } } 4. Dump HS2 heap org.datanucleus.api.jdo.JDOPersistenceManager instances keep increasing. was: 1. start-thriftserver.sh under SPARK2.3.1 2. Create Table and insert values create table test_leak (id string, index int); insert into test_leak values('id1',1) 3. Create JDBC Client query the table import java.sql.*; public class HiveClient { public static void main(String[] args) throws Exception { String driverName = "org.apache.hive.jdbc.HiveDriver"; Class.forName(driverName); Connection con = DriverManager.getConnection( "jdbc:hive2://localhost:1/default", "test", "test"); Statement stmt = con.createStatement(); String sql = "select * from test_leak"; int loop = 100; while ( loop -- > 0) { ResultSet rs = stmt.executeQuery(sql); rs.next(); System.out.println(new java.sql.Timestamp(System.currentTimeMillis()) +" : " + rs.getString(1)); rs.close(); } con.close(); } } 4. Dump HS2 heap org.datanucleus.api.jdo.JDOPersistenceManager instances keep increasing. > JDOPersistenceManager leak when query via JDBC > -- > > Key: SPARK-25804 > URL: https://issues.apache.org/jira/browse/SPARK-25804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > 1. start-thriftserver.sh under SPARK2.3.1 > 2. Create Table and insert values > create table test_leak (id string, index int); > insert into test_leak values('id1',1) > 3. Create JDBC Client query the table > import java.sql.*; > public class HiveClient { > public static void main(String[] args) throws Exception { > String driverName = "org.apache.hive.jdbc.HiveDriver"; > Class.forName(driverName); > Connection con = DriverManager.getConnection( > "jdbc:hive2://localhost:1/default", "test", "test"); > Statement stmt = con.createStatement(); > String sql = "select * from test_leak"; > int loop = 100; > while ( loop – > 0) { > ResultSet rs = stmt.executeQuery(sql); > rs.next(); > System.out.println(new java.sql.Timestamp(System.currentTimeMillis()) +" > : " + rs.getString(1)); > rs.close(); > if( loop % 100 ==0){ > Thread.sleep(1); > } > } > con.close(); > } > } > 4. Dump HS2 heap org.datanucleus.api.jdo.JDOPersistenceManager instances keep > increasing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25804) JDOPersistenceManager leak when query via JDBC
pin_zhang created SPARK-25804: - Summary: JDOPersistenceManager leak when query via JDBC Key: SPARK-25804 URL: https://issues.apache.org/jira/browse/SPARK-25804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: pin_zhang 1. start-thriftserver.sh under SPARK2.3.1 2. Create Table and insert values create table test_leak (id string, index int); insert into test_leak values('id1',1) 3. Create JDBC Client query the table import java.sql.*; public class HiveClient { public static void main(String[] args) throws Exception { String driverName = "org.apache.hive.jdbc.HiveDriver"; Class.forName(driverName); Connection con = DriverManager.getConnection( "jdbc:hive2://localhost:1/default", "test", "test"); Statement stmt = con.createStatement(); String sql = "select * from test_leak"; int loop = 100; while ( loop -- > 0) { ResultSet rs = stmt.executeQuery(sql); rs.next(); System.out.println(new java.sql.Timestamp(System.currentTimeMillis()) +" : " + rs.getString(1)); rs.close(); } con.close(); } } 4. Dump HS2 heap org.datanucleus.api.jdo.JDOPersistenceManager instances keep increasing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25169) Multiple DataFrames cannot write to the same folder concurrently
[ https://issues.apache.org/jira/browse/SPARK-25169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-25169: -- Component/s: (was: Spark Core) SQL > Multiple DataFrames cannot write to the same folder concurrently > > > Key: SPARK-25169 > URL: https://issues.apache.org/jira/browse/SPARK-25169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: pin_zhang >Priority: Major > > > Seems DataFrame writer cannot support write to the same folder concurrently. > Steps to reproduce > val sc = new SparkContext(conf) > val hiveContext = new HiveContext(sc) > val source="file:///G:/home/json" > val target ="file:///G:/home/oad" > new Thread(new Runnable { > override def run(): Unit = { > hiveContext.jsonFile(source).write.mode(SaveMode.Append).json(target) > Thread.sleep(1000L) > } > }).start() > new Thread(new Runnable { > override def run(): Unit = { > hiveContext.jsonFile(source).write.mode(SaveMode.Append).json(target) > Thread.sleep(1000L) > } > }).start() > new Thread(new Runnable { > override def run(): Unit = { > hiveContext.jsonFile(source).write.mode(SaveMode.Append).json(target) > Thread.sleep(1000L) > } > }).start() > > Meet exceptions > java.io.FileNotFoundException: File > file:/G:/home/oad/_temporary/0/task_20180821151921_0004_m_01/.part-1-463ee671-0ef0-42ff-8968-1d960bc87996-c000.json.crc > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824) > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25169) Multiple DataFrames cannot write to the same folder concurrently
pin_zhang created SPARK-25169: - Summary: Multiple DataFrames cannot write to the same folder concurrently Key: SPARK-25169 URL: https://issues.apache.org/jira/browse/SPARK-25169 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1 Reporter: pin_zhang Seems DataFrame writer cannot support write to the same folder concurrently. Steps to reproduce val sc = new SparkContext(conf) val hiveContext = new HiveContext(sc) val source="file:///G:/home/json" val target ="file:///G:/home/oad" new Thread(new Runnable { override def run(): Unit = { hiveContext.jsonFile(source).write.mode(SaveMode.Append).json(target) Thread.sleep(1000L) } }).start() new Thread(new Runnable { override def run(): Unit = { hiveContext.jsonFile(source).write.mode(SaveMode.Append).json(target) Thread.sleep(1000L) } }).start() new Thread(new Runnable { override def run(): Unit = { hiveContext.jsonFile(source).write.mode(SaveMode.Append).json(target) Thread.sleep(1000L) } }).start() Meet exceptions java.io.FileNotFoundException: File file:/G:/home/oad/_temporary/0/task_20180821151921_0004_m_01/.part-1-463ee671-0ef0-42ff-8968-1d960bc87996-c000.json.crc does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24749) Cannot filter array with named_struct
pin_zhang created SPARK-24749: - Summary: Cannot filter array with named_struct Key: SPARK-24749 URL: https://issues.apache.org/jira/browse/SPARK-24749 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: pin_zhang 1. Create Table create table arr__int( arr array> )stored as parquet; 2. Insert data insert into arr__int values( array(named_struct('a', 1))); 3. Filter with struct data select * from arr__int where array_contains (arr, named_struct('a', 1)); Error: org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(arr__int.`arr`, named_struct('a', 1))' due to data type mismatch: Arguments must be an array followed by a value of same type as the array members; line 1 pos 29; 'Project [*] +- 'Filter array_contains(arr#6, named_struct(a, 1)) +- SubqueryAlias arr__int +- Relation[arr#6] parquet (state=,code=0) Caused by schema null is always false in named_struct -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23371) Parquet Footer data is wrong on window in parquet format partition table
[ https://issues.apache.org/jira/browse/SPARK-23371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373862#comment-16373862 ] pin_zhang commented on SPARK-23371: --- # It's the spark that bundles two versions(1.6 and 1.8) parquet jars in classpath. # data write with parquet 1.6 and read with 1.8 with the steps. # parquet 1.6 write wrong footer in spark, as it cannot load version info on Windows OS. > Parquet Footer data is wrong on window in parquet format partition table > - > > Key: SPARK-23371 > URL: https://issues.apache.org/jira/browse/SPARK-23371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2 >Reporter: pin_zhang >Priority: Major > > On window > Run SQL in spark shell > spark.sql("create table part_test (id string )partitioned by( index int) > stored as parquet") > spark.sql("insert into part_test partition (index =1) values ('1')") > Get exception when query spark.sql("select * from part_test ").show() > For the parquet.Version in parquet-hadoop-bundle-1.6.0.jar cannot load the > version info in spark on window. Classloader try to get version in the > parquet-format-2.3.0-incubating.jar > 18/02/09 16:58:48 WARN CorruptStatistics: Ignoring statistics because > created_by > could not be parsed (see PARQUET-251): parquet-mr > org.apache.parquet.VersionParser$VersionParseException: Could not parse > created_ > by: parquet-mr using format: (.+) version ((.*) )?(build ?(.*)) > at org.apache.parquet.VersionParser.parse(VersionParser.java:112) > at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptSt > atistics.java:60) > at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParq > uetStatistics(ParquetMetadataConverter.java:263) > at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(Parque > tFileReader.java:583) > at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetF > ileReader.java:513) > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR > ecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR > ecordReader.nextBatch(VectorizedParquetRecordReader.java:225) > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR > ecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNe > xt(RecordReaderIterator.scala:39) > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex > t(FileScanRDD.scala:109) > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIt > erator(FileScanRDD.scala:184) > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex > t(FileScanRDD.scala:109) > at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte > rator.scan_nextBatch$(Unknown Source) > at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte > rator.processNext(Unknown Source) > at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRo > wIterator.java:43) > at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon > $1.hasNext(WholeStageCodegenExec.scala:377) > at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s > cala:231) > at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s > cala:225) > at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap > ply$25.apply(RDD.scala:827) > at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap > ply$25.apply(RDD.scala:827) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala: > 38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. > java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor > .java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23371) Parquet Footer data is wrong on window in parquet format partition table
pin_zhang created SPARK-23371: - Summary: Parquet Footer data is wrong on window in parquet format partition table Key: SPARK-23371 URL: https://issues.apache.org/jira/browse/SPARK-23371 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.2, 2.1.1 Reporter: pin_zhang On window Run SQL in spark shell spark.sql("create table part_test (id string )partitioned by( index int) stored as parquet") spark.sql("insert into part_test partition (index =1) values ('1')") Get exception when query spark.sql("select * from part_test ").show() For the parquet.Version in parquet-hadoop-bundle-1.6.0.jar cannot load the version info in spark on window. Classloader try to get version in the parquet-format-2.3.0-incubating.jar 18/02/09 16:58:48 WARN CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr org.apache.parquet.VersionParser$VersionParseException: Could not parse created_ by: parquet-mr using format: (.+) version ((.*) )?(build ?(.*)) at org.apache.parquet.VersionParser.parse(VersionParser.java:112) at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptSt atistics.java:60) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParq uetStatistics(ParquetMetadataConverter.java:263) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(Parque tFileReader.java:583) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetF ileReader.java:513) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR ecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR ecordReader.nextBatch(VectorizedParquetRecordReader.java:225) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR ecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNe xt(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex t(FileScanRDD.scala:109) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIt erator(FileScanRDD.scala:184) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex t(FileScanRDD.scala:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte rator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte rator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRo wIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon $1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s cala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s cala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap ply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap ply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala: 38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor .java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23086) Spark SQL cannot support high concurrency for lock in HiveMetastoreCatalog
[ https://issues.apache.org/jira/browse/SPARK-23086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-23086: -- Description: * Hive metastore is mysql * Set hive.server2.thrift.max.worker.threads=500 create table test (id string ) partitioned by (index int) stored as parquet; insert into test partition (index=1) values('id1'); * 100 Clients run SQL“select * from table” on table * Many clients (97%) blocked at HiveExternalCatalog.withClient * Is synchronized expected when only run query against tables? "pool-21-thread-65" #1178 prio=5 os_prio=0 tid=0x2aaac8e06800 nid=0x1e70 waiting for monitor entry [0x4e19a000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) - waiting to lock <0xc06a3ba8> (a org.apache.spark.sql.hive.HiveExternalCatalog) at org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:674) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:667) - locked <0xc41ab748> (a org.apache.spark.sql.hive.HiveSessionCatalog) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:646) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:601) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:631) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:624) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:59) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:59) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:59) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:624) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:570) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69) - locked <0xff491c48> (a org.apache.spark.sql.execution.QueryExecution) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:691) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:231) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171) at
[jira] [Created] (SPARK-23086) Spark SQL cannot support high concurrency for lock in HiveMetastoreCatalog
pin_zhang created SPARK-23086: - Summary: Spark SQL cannot support high concurrency for lock in HiveMetastoreCatalog Key: SPARK-23086 URL: https://issues.apache.org/jira/browse/SPARK-23086 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.1 Environment: * Spark 2.2.1 Reporter: pin_zhang * Hive metastore is mysql * Set hive.server2.thrift.max.worker.threads=500 create table test (id string ) partitioned by (index int) stored as parquet; insert into pz_tb partition (index=1) values('id1'); * 100 Clients run SQL“select * from table” on cached table * Many clients (97%) blocked at HiveExternalCatalog.withClient * Is synchronized expected when only run query against tables? "pool-21-thread-65" #1178 prio=5 os_prio=0 tid=0x2aaac8e06800 nid=0x1e70 waiting for monitor entry [0x4e19a000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) - waiting to lock <0xc06a3ba8> (a org.apache.spark.sql.hive.HiveExternalCatalog) at org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:674) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:667) - locked <0xc41ab748> (a org.apache.spark.sql.hive.HiveSessionCatalog) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:646) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:601) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:631) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:624) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:59) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:59) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:59) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:624) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:570) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69) - locked <0xff491c48> (a org.apache.spark.sql.execution.QueryExecution) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:691) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:231) at
[jira] [Created] (SPARK-22420) Spark SQL return invalid json string for struct with date/datetime field
pin_zhang created SPARK-22420: - Summary: Spark SQL return invalid json string for struct with date/datetime field Key: SPARK-22420 URL: https://issues.apache.org/jira/browse/SPARK-22420 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: pin_zhang Priority: Normal Run SQL with JDBC client in spark hiveserver2 select named_struct ( 'b',current_timestamp) from test; +---+--+ | named_struct(b, current_timestamp()) | +---+--+ | {"b":2017-11-01 23:18:40.988} | The json string is is invalid, date time value should be quoted. If run sql in Apache hiveserver2, get expected json string select named_struct ( 'b',current_timestamp) from dummy_table ; +--+--+ | _c0| +--+--+ | {"b":"2017-11-01 23:21:24.168"} | +--+--+ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21437) Java Keyword cannot be used in table schema
[ https://issues.apache.org/jira/browse/SPARK-21437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091100#comment-16091100 ] pin_zhang commented on SPARK-21437: --- Hive doesn't have such limitation, we can create table with sql "create table `long` ( `long` long)" Isn't a spark bug? > Java Keyword cannot be used in table schema > --- > > Key: SPARK-21437 > URL: https://issues.apache.org/jira/browse/SPARK-21437 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: pin_zhang > > Java keywords doesn't work in spark211 that works in spark 201 > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.sql.SparkSession > case class a(`const`: Int) > case class b(aa: a) > object KeyworkdsTest { > def main(args: Array[String]): Unit = { > val conf = new SparkConf().setAppName("scala").setMaster("local[2]") > val sc = new SparkContext(conf) > val spark = > SparkSession.builder().enableHiveSupport().config(conf).getOrCreate() > val q = Seq(b(a(1))) > val rdd = sc.makeRDD(q) > val d = spark.createDataFrame(rdd) > } > } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21437) Java Keyword cannot be used in table schema
pin_zhang created SPARK-21437: - Summary: Java Keyword cannot be used in table schema Key: SPARK-21437 URL: https://issues.apache.org/jira/browse/SPARK-21437 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: pin_zhang Java keywords doesn't work in spark211 that works in spark 201 import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.SparkSession case class a(`const`: Int) case class b(aa: a) object KeyworkdsTest { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("scala").setMaster("local[2]") val sc = new SparkContext(conf) val spark = SparkSession.builder().enableHiveSupport().config(conf).getOrCreate() val q = Seq(b(a(1))) val rdd = sc.makeRDD(q) val d = spark.createDataFrame(rdd) } } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21105) Useless empty files in hive table
pin_zhang created SPARK-21105: - Summary: Useless empty files in hive table Key: SPARK-21105 URL: https://issues.apache.org/jira/browse/SPARK-21105 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.1 Reporter: pin_zhang case class Base(v: Option[Double]) object EmptyFiles { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("scala").setMaster("local[12]") val ctx = new SparkContext(conf) val spark = SparkSession.builder().enableHiveSupport().config(conf).getOrCreate() val seq = Seq(Base(Some(1D)), Base(Some(1D))); val rdd = ctx.makeRDD[Base](seq) import spark.implicits._ rdd.toDS().write.format("json").mode(SaveMode.Append).saveAsTable("EmptyFiles") } } // DataSet create many useless empty files for empty partition // if insert small RDD into the table many times, which result in too many empty files, which slow down the query. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18536) Failed to save to hive table when case class with empty field
pin_zhang created SPARK-18536: - Summary: Failed to save to hive table when case class with empty field Key: SPARK-18536 URL: https://issues.apache.org/jira/browse/SPARK-18536 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: pin_zhang import scala.collection.mutable.Queue import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.SaveMode import org.apache.spark.sql.SparkSession import org.apache.spark.streaming.Seconds import org.apache.spark.streaming.StreamingContext 1. Test code case class EmptyC() case class EmptyCTable(dimensions: EmptyC, timebin: java.lang.Long) object EmptyTest { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("scala").setMaster("local[2]") val ctx = new SparkContext(conf) val spark = SparkSession.builder().enableHiveSupport().config(conf).getOrCreate() val seq = Seq(EmptyCTable(EmptyC(), 100L)) val rdd = ctx.makeRDD[EmptyCTable](seq) val ssc = new StreamingContext(ctx, Seconds(1)) val queue = Queue(rdd) val s = ssc.queueStream(queue, false); s.foreachRDD((rdd, time) => { if (!rdd.isEmpty) { import spark.sqlContext.implicits._ rdd.toDF.write.mode(SaveMode.Overwrite).saveAsTable("empty_table") } }) ssc.start() ssc.awaitTermination() } } 2. Exception Caused by: java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:554) at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:426) at org.apache.parquet.schema.Types$Builder.named(Types.java:228) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:527) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at org.apache.spark.sql.types.StructType.map(StructType.scala:95) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convert(ParquetSchemaConverter.scala:313) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:85) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139) at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) ... 3 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17398) Failed to query on external JSon Partitioned table
[ https://issues.apache.org/jira/browse/SPARK-17398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang closed SPARK-17398. - Resolution: Fixed Fix Version/s: 2.0.1 > Failed to query on external JSon Partitioned table > -- > > Key: SPARK-17398 > URL: https://issues.apache.org/jira/browse/SPARK-17398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: pin_zhang > Fix For: 2.0.1 > > > 1. Create External Json partitioned table > with SerDe in hive-hcatalog-core-1.2.1.jar, download fom > https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/1.2.1 > 2. Query table meet exception, which works in spark1.5.2 > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: > Lost task > 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: > java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord > at > org.apache.hive.hcatalog.data.HCatRecordObjectInspector.getStructFieldData(HCatRecordObjectInspector.java:45) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:430) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > > 3. Test Code > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.sql.hive.HiveContext > object JsonBugs { > def main(args: Array[String]): Unit = { > val table = "test_json" > val location = "file:///g:/home/test/json" > val create = s"""CREATE EXTERNAL TABLE ${table} > (id string, seq string ) > PARTITIONED BY(index int) > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > LOCATION "${location}" > """ > val add_part = s""" > ALTER TABLE ${table} ADD > PARTITION (index=1)LOCATION '${location}/index=1' > """ > val conf = new SparkConf().setAppName("scala").setMaster("local[2]") > conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse") > val ctx = new SparkContext(conf) > val hctx = new HiveContext(ctx) > val exist = hctx.tableNames().map { x => x.toLowerCase() }.contains(table) > if (!exist) { > hctx.sql(create) > hctx.sql(add_part) > } else { > hctx.sql("show partitions " + table).show() > } > hctx.sql("select * from test_json").show() > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-17932: -- Description: SQL "show table extended like table_name " doesn't work in spark 2.0.0 that works in spark1.5.2 Error: org.apache.spark.sql.catalyst.parser.ParseException: missing 'FUNCTIONS' at 'extended'(line 1, pos 11) == SQL == show table extended like test ---^^^ (state=,code=0) was: SQL "show table extended like table_name " doesn't work in spark 2.0.0 that works in spark1.5.2 > Failed to run SQL "show table extended like table_name" in Spark2.0.0 > --- > > Key: SPARK-17932 > URL: https://issues.apache.org/jira/browse/SPARK-17932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: pin_zhang > > SQL "show table extended like table_name " doesn't work in spark 2.0.0 > that works in spark1.5.2 > Error: org.apache.spark.sql.catalyst.parser.ParseException: > missing 'FUNCTIONS' at 'extended'(line 1, pos 11) > == SQL == > show table extended like test > ---^^^ (state=,code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0
pin_zhang created SPARK-17932: - Summary: Failed to run SQL "show table extended like table_name" in Spark2.0.0 Key: SPARK-17932 URL: https://issues.apache.org/jira/browse/SPARK-17932 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: pin_zhang SQL "show table extended like table_name " doesn't work in spark 2.0.0 that works in spark1.5.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12008) Spark hive security authorization doesn't work as Apache hive's
[ https://issues.apache.org/jira/browse/SPARK-12008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15482828#comment-15482828 ] pin_zhang commented on SPARK-12008: --- Does Spark SQL have any plan to support authrization in the near futrue? > Spark hive security authorization doesn't work as Apache hive's > --- > > Key: SPARK-12008 > URL: https://issues.apache.org/jira/browse/SPARK-12008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: pin_zhang > > Spark hive security authorization doesn't consistent with apache hive > The same hive-site.xml > > hive.security.authorization.enabled > true > > > hive.security.authorization.manager > org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory > > > hive.security.authenticator.manager > org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator > > > hive.server2.enable.doAs > true > > 1. Run spark start-thriftserver.sh, Will meet exception when run sql. >SQL standards based authorization should not be enabled from hive > cliInstead the use of storage based authorization in hive metastore is > reccomended. >Set hive.security.authorization.enabled=false to disable authz within cli > 2. Change to start start-thriftserver.sh with hive configurations > ./start-thriftserver.sh --conf > hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory > --conf > hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator > > 3. Beeline connect with userA and create table tableA. > 4. Beeline connect with userB to truncate tableA > A) In Apache hive, truncate table get exception > Error while compiling statement: FAILED: HiveAccessControlException > Permission denied: Principal [name=userB, type=USER] does not have following > privileges for operation TRUNCATETABLE [[OBJECT OWNERSHIP] on Object > [type=TABLE_OR_VIEW, name=default.tablea]] (state=42000,code=4) > B) In Spark hive, any user that can connect to the hive, can truncate, as > long as the spark user has privileges. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17396) Threads number keep increasing when query on external CSV partitioned table
[ https://issues.apache.org/jira/browse/SPARK-17396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466180#comment-15466180 ] pin_zhang commented on SPARK-17396: --- "Thread-1902" daemon prio=6 tid=0x14078800 nid=0x3a6c runnable [0x38d5e000] "Thread-1901" daemon prio=6 tid=0x0c64f800 nid=0x32fc runnable [0x191ef000] "Thread-1900" daemon prio=6 tid=0x14249800 nid=0x263c runnable [0x4c73e000] "Thread-1899" daemon prio=6 tid=0x14244000 nid=0x189c runnable [0x17c7e000] "Thread-1898" daemon prio=6 tid=0x0d96a800 nid=0x3e54 runnable [0x4c5ef000] "ForkJoinPool-120-worker-1" daemon prio=6 tid=0x1407d000 nid=0x2234 waiting for monitor entry [0x4c31e000] "ForkJoinPool-120-worker-3" daemon prio=6 tid=0x13a64000 nid=0x1f0c waiting for monitor entry [0x4c0de000] "ForkJoinPool-120-worker-5" daemon prio=6 tid=0x13a75800 nid=0x1660 waiting for monitor entry [0x4241e000] "ForkJoinPool-120-worker-7" daemon prio=6 tid=0x13d6c000 nid=0x117c waiting for monitor entry [0x4bece000] "ForkJoinPool-120-worker-9" daemon prio=6 tid=0x14233800 nid=0x2a20 waiting for monitor entry [0x4bd3e000] "ForkJoinPool-120-worker-11" daemon prio=6 tid=0x1423f800 nid=0x3568 waiting for monitor entry [0x4afae000] "ForkJoinPool-120-worker-13" daemon prio=6 tid=0x1424e000 nid=0x378c waiting for monitor entry [0x4bc0e000] "ForkJoinPool-120-worker-15" daemon prio=6 tid=0x14238000 nid=0x1b8c waiting for monitor entry [0x18dfd000] "ForkJoinPool-119-worker-1" daemon prio=6 tid=0x13d74800 nid=0x29a0 waiting for monitor entry [0x4bade000] "ForkJoinPool-119-worker-3" daemon prio=6 tid=0x12cd4000 nid=0x18a0 in Object.wait() [0x4b9ae000] "ForkJoinPool-119-worker-7" daemon prio=6 tid=0x12cd3000 nid=0x15ec waiting for monitor entry [0x4b87d000] "ForkJoinPool-119-worker-5" daemon prio=6 tid=0x13bbd800 nid=0x2c24 waiting for monitor entry [0x4b76d000] "ForkJoinPool-119-worker-9" daemon prio=6 tid=0x13bc9800 nid=0x3d78 waiting for monitor entry [0x2acae000] "ForkJoinPool-119-worker-11" daemon prio=6 tid=0x0d9eb000 nid=0x3f40 waiting for monitor entry [0x4b57e000] "ForkJoinPool-119-worker-13" daemon prio=6 tid=0x0d9e4800 nid=0x286c waiting for monitor entry [0x4b40e000] "ForkJoinPool-119-worker-15" daemon prio=6 tid=0x0d9e9000 nid=0x2304 in Object.wait() [0x194de000] "ForkJoinPool-118-worker-1" daemon prio=6 tid=0x14077000 nid=0x3a50 runnable [0x393dd000] "ForkJoinPool-118-worker-3" daemon prio=6 tid=0x1407a000 nid=0x1dc0 runnable [0x2331d000] "ForkJoinPool-118-worker-5" daemon prio=6 tid=0x0d2f9000 nid=0x2990 runnable [0x1b6fd000] "ForkJoinPool-118-worker-7" daemon prio=6 tid=0x0d2df800 nid=0x3bb4 runnable [0x4a9dd000] "ForkJoinPool-118-worker-9" daemon prio=6 tid=0x0d2f7800 nid=0x37e4 waiting for monitor entry [0x2bf5e000] "ForkJoinPool-118-worker-11" daemon prio=6 tid=0x12648000 nid=0x2878 runnable [0x2b26d000] "ForkJoinPool-118-worker-13" daemon prio=6 tid=0x12646000 nid=0x4cc waiting for monitor entry [0x183de000] "ForkJoinPool-118-worker-15" daemon prio=6 tid=0x12647800 nid=0x30c8 waiting for monitor entry [0x2bd3d000] "ForkJoinPool-117-worker-5" daemon prio=6 tid=0x12b5c800 nid=0x3510 waiting for monitor entry [0x4b2be000] "ForkJoinPool-117-worker-1" daemon prio=6 tid=0x12b5d000 nid=0x36b8 waiting for monitor entry [0x4b11e000] "ForkJoinPool-117-worker-3" daemon prio=6 tid=0x12eac800 nid=0x32d4 in Object.wait() [0x4acae000] "ForkJoinPool-117-worker-7" daemon prio=6 tid=0x12ea9800 nid=0x16c4 waiting for monitor entry [0x4ab1e000] "ForkJoinPool-117-worker-9" daemon prio=6 tid=0x12e9b000 nid=0x1e44 waiting for monitor entry [0x2162e000] "ForkJoinPool-117-worker-11" daemon prio=6 tid=0x13bcc000 nid=0x37f4 waiting for monitor entry [0x40dee000] "ForkJoinPool-117-worker-13" daemon prio=6 tid=0x13bcb000 nid=0x361c in Object.wait() [0x35dbe000] "ForkJoinPool-117-worker-15" daemon prio=6 tid=0x13bca800 nid=0x3344 in Object.wait() [0x2c0ce000] "ForkJoinPool-116-worker-1" daemon prio=6 tid=0x13bc9000 nid=0x3a34 runnable [0x4867d000] "ForkJoinPool-116-worker-3" daemon prio=6 tid=0x13bc8000 nid=0x1c10 in Object.wait() [0x4a8be000] "ForkJoinPool-116-worker-7" daemon prio=6 tid=0x13bc7800 nid=0x2910 waiting on condition [0x45e7f000] "ForkJoinPool-116-worker-5" daemon prio=6 tid=0x13bc6800 nid=0x3b1c waiting for monitor entry
[jira] [Commented] (SPARK-17396) Threads number keep increasing when query on external CSV partitioned table
[ https://issues.apache.org/jira/browse/SPARK-17396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464602#comment-15464602 ] pin_zhang commented on SPARK-17396: --- 1.Thousand of thread created look like ForkJoinPool-20-worker-9" #329 daemon prio=5 os_prio=0 tid=0x0ac87000 nid=0x3d43 waiting on condition [0x5069f000] "ForkJoinPool-19-worker-3" #324 daemon prio=5 os_prio=0 tid=0x0ae6 nid=0x3c2a waiting on condition [0x5039c000] 2.The thread should be created by UnionRDD > Threads number keep increasing when query on external CSV partitioned table > --- > > Key: SPARK-17396 > URL: https://issues.apache.org/jira/browse/SPARK-17396 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: pin_zhang > > 1. Create a external partitioned table row format CSV > 2. Add 16 partitions to the table > 3. Run SQL "select count(*) from test_csv" > 4. ForkJoinThread number keep increasing > This happend when table partitions number greater than 10. > 5. Test Code > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.sql.hive.HiveContext > object Bugs { > def main(args: Array[String]): Unit = { > val location = "file:///g:/home/test/csv" > val create = s"""CREATE EXTERNAL TABLE test_csv > (ID string, SEQ string ) > PARTITIONED BY(index int) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > LOCATION "${location}" > """ > val add_part = s""" > ALTER TABLE test_csv ADD > PARTITION (index=1)LOCATION '${location}/index=1' > PARTITION (index=2)LOCATION '${location}/index=2' > PARTITION (index=3)LOCATION '${location}/index=3' > PARTITION (index=4)LOCATION '${location}/index=4' > PARTITION (index=5)LOCATION '${location}/index=5' > PARTITION (index=6)LOCATION '${location}/index=6' > PARTITION (index=7)LOCATION '${location}/index=7' > PARTITION (index=8)LOCATION '${location}/index=8' > PARTITION (index=9)LOCATION '${location}/index=9' > PARTITION (index=10)LOCATION '${location}/index=10' > PARTITION (index=11)LOCATION '${location}/index=11' > PARTITION (index=12)LOCATION '${location}/index=12' > PARTITION (index=13)LOCATION '${location}/index=13' > PARTITION (index=14)LOCATION '${location}/index=14' > PARTITION (index=15)LOCATION '${location}/index=15' > PARTITION (index=16)LOCATION '${location}/index=16' > """ > val conf = new SparkConf().setAppName("scala").setMaster("local[2]") > conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse") > val ctx = new SparkContext(conf) > val hctx = new HiveContext(ctx) > hctx.sql(create) > hctx.sql(add_part) > for (i <- 1 to 6) { > new Query(hctx).start() > } > } > class Query(htcx: HiveContext) extends Thread { > setName("Query-Thread") > override def run = { > while (true) { > htcx.sql("select count(*) from test_csv").show() > Thread.sleep(100) > } > } > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17398) Failed to query on external JSon Partitioned table
pin_zhang created SPARK-17398: - Summary: Failed to query on external JSon Partitioned table Key: SPARK-17398 URL: https://issues.apache.org/jira/browse/SPARK-17398 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: pin_zhang 1. Create External Json partitioned table with SerDe in hive-hcatalog-core-1.2.1.jar, download fom https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/1.2.1 2. Query table meet exception, which works in spark1.5.2 Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord at org.apache.hive.hcatalog.data.HCatRecordObjectInspector.getStructFieldData(HCatRecordObjectInspector.java:45) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:430) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) 3. Test Code import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.hive.HiveContext object JsonBugs { def main(args: Array[String]): Unit = { val table = "test_json" val location = "file:///g:/home/test/json" val create = s"""CREATE EXTERNAL TABLE ${table} (id string, seq string ) PARTITIONED BY(index int) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION "${location}" """ val add_part = s""" ALTER TABLE ${table} ADD PARTITION (index=1)LOCATION '${location}/index=1' """ val conf = new SparkConf().setAppName("scala").setMaster("local[2]") conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse") val ctx = new SparkContext(conf) val hctx = new HiveContext(ctx) val exist = hctx.tableNames().map { x => x.toLowerCase() }.contains(table) if (!exist) { hctx.sql(create) hctx.sql(add_part) } else { hctx.sql("show partitions " + table).show() } hctx.sql("select * from test_json").show() } } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17396) Threads number keep increasing when query on external CSV partitioned table
pin_zhang created SPARK-17396: - Summary: Threads number keep increasing when query on external CSV partitioned table Key: SPARK-17396 URL: https://issues.apache.org/jira/browse/SPARK-17396 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: pin_zhang 1. Create a external partitioned table row format CSV 2. Add 16 partitions to the table 3. Run SQL "select count(*) from test_csv" 4. ForkJoinThread number keep increasing This happend when table partitions number greater than 10. 5. Test Code import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.hive.HiveContext object Bugs { def main(args: Array[String]): Unit = { val location = "file:///g:/home/test/csv" val create = s"""CREATE EXTERNAL TABLE test_csv (ID string, SEQ string ) PARTITIONED BY(index int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION "${location}" """ val add_part = s""" ALTER TABLE test_csv ADD PARTITION (index=1)LOCATION '${location}/index=1' PARTITION (index=2)LOCATION '${location}/index=2' PARTITION (index=3)LOCATION '${location}/index=3' PARTITION (index=4)LOCATION '${location}/index=4' PARTITION (index=5)LOCATION '${location}/index=5' PARTITION (index=6)LOCATION '${location}/index=6' PARTITION (index=7)LOCATION '${location}/index=7' PARTITION (index=8)LOCATION '${location}/index=8' PARTITION (index=9)LOCATION '${location}/index=9' PARTITION (index=10)LOCATION '${location}/index=10' PARTITION (index=11)LOCATION '${location}/index=11' PARTITION (index=12)LOCATION '${location}/index=12' PARTITION (index=13)LOCATION '${location}/index=13' PARTITION (index=14)LOCATION '${location}/index=14' PARTITION (index=15)LOCATION '${location}/index=15' PARTITION (index=16)LOCATION '${location}/index=16' """ val conf = new SparkConf().setAppName("scala").setMaster("local[2]") conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse") val ctx = new SparkContext(conf) val hctx = new HiveContext(ctx) hctx.sql(create) hctx.sql(add_part) for (i <- 1 to 6) { new Query(hctx).start() } } class Query(htcx: HiveContext) extends Thread { setName("Query-Thread") override def run = { while (true) { htcx.sql("select count(*) from test_csv").show() Thread.sleep(100) } } } } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17395) Queries on CSV partition table result in frequent GC
pin_zhang created SPARK-17395: - Summary: Queries on CSV partition table result in frequent GC Key: SPARK-17395 URL: https://issues.apache.org/jira/browse/SPARK-17395 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0, 1.6.2, 1.5.2 Reporter: pin_zhang 1. Create external partitioned table and run sqls against the table 2. Run the queries for a while, driver JVM does frequent GC increase head size won't resolve this issue. 3. Test code import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 object Bugs { def main(args: Array[String]): Unit = { val location = "file:///g:/home/test/csv" val create = s"""CREATE EXTERNAL TABLE test_csv (ID string, SEQ string ) PARTITIONED BY(index int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION "${location}" """ val add_part = s""" ALTER TABLE test_csv ADD PARTITION (index=1)LOCATION '${location}/index=1' PARTITION (index=2)LOCATION '${location}/index=2' PARTITION (index=3)LOCATION '${location}/index=3' PARTITION (index=4)LOCATION '${location}/index=4' PARTITION (index=5)LOCATION '${location}/index=5' PARTITION (index=6)LOCATION '${location}/index=6' PARTITION (index=7)LOCATION '${location}/index=7' PARTITION (index=8)LOCATION '${location}/index=8' PARTITION (index=9)LOCATION '${location}/index=9' PARTITION (index=10)LOCATION '${location}/index=10' PARTITION (index=11)LOCATION '${location}/index=11' PARTITION (index=12)LOCATION '${location}/index=12' PARTITION (index=13)LOCATION '${location}/index=13' PARTITION (index=14)LOCATION '${location}/index=14' PARTITION (index=15)LOCATION '${location}/index=15' PARTITION (index=16)LOCATION '${location}/index=16' """ val conf = new SparkConf().setAppName("scala").setMaster("local[2]") val ctx = new SparkContext(conf) val hctx = new HiveContext(ctx) hctx.sql(create) hctx.sql(add_part) for (i <- 1 to 6) { new Query(hctx).start() } } class Query(htcx: HiveContext) extends Thread { setName("Query-Thread") override def run = { while (true) { htcx.sql("select count(*) from test_csv").show() Thread.sleep(100) } } } } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata
[ https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358916#comment-15358916 ] pin_zhang commented on SPARK-9686: -- Any plan to fix this bug? > Spark Thrift server doesn't return correct JDBC metadata > - > > Key: SPARK-9686 > URL: https://issues.apache.org/jira/browse/SPARK-9686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2 >Reporter: pin_zhang >Assignee: Cheng Lian >Priority: Critical > Attachments: SPARK-9686.1.patch.txt > > > 1. Start start-thriftserver.sh > 2. connect with beeline > 3. create table > 4.show tables, the new created table returned > 5. > Class.forName("org.apache.hive.jdbc.HiveDriver"); > String URL = "jdbc:hive2://localhost:1/default"; >Properties info = new Properties(); > Connection conn = DriverManager.getConnection(URL, info); > ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(), >null, null, null); > Problem: >No tables with returned this API, that work in spark1.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12262) describe extended doesn't return table on detail info tabled stored as PARQUET format
pin_zhang created SPARK-12262: - Summary: describe extended doesn't return table on detail info tabled stored as PARQUET format Key: SPARK-12262 URL: https://issues.apache.org/jira/browse/SPARK-12262 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Reporter: pin_zhang 1. start hive server with start-thriftserver.sh 2. create table table1 (id int) ; create table table2(id int) STORED AS PARQUET; 3. describe extended table1 ; return detailed info 4. describe extended table2 ; result has no detailed info -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10290) Spark can register temp table and hive table with the same table name
[ https://issues.apache.org/jira/browse/SPARK-10290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang closed SPARK-10290. - not a bug > Spark can register temp table and hive table with the same table name > - > > Key: SPARK-10290 > URL: https://issues.apache.org/jira/browse/SPARK-10290 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: pin_zhang > > Spark sql allow to create hive table and register temp table with the same > name > no way to run query on the hive table table with the following code > // register hive table > DataFrame df = hctx_.read().json("test.json"); > df.write().mode(SaveMode.Overwrite).saveAsTable("test"); > // register temp table > hctx_.registerDataFrameAsTable(hctx_.sql("select id from test"), "test"); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12008) Spark hive security authorization doesn't work as Apache hive's
[ https://issues.apache.org/jira/browse/SPARK-12008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035159#comment-15035159 ] pin_zhang commented on SPARK-12008: --- Any comments? > Spark hive security authorization doesn't work as Apache hive's > --- > > Key: SPARK-12008 > URL: https://issues.apache.org/jira/browse/SPARK-12008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: pin_zhang > > Spark hive security authorization doesn't consistent with apache hive > The same hive-site.xml > > hive.security.authorization.enabled > true > > > hive.security.authorization.manager > org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory > > > hive.security.authenticator.manager > org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator > > > hive.server2.enable.doAs > true > > 1. Run spark start-thriftserver.sh, Will meet exception when run sql. >SQL standards based authorization should not be enabled from hive > cliInstead the use of storage based authorization in hive metastore is > reccomended. >Set hive.security.authorization.enabled=false to disable authz within cli > 2. Change to start start-thriftserver.sh with hive configurations > ./start-thriftserver.sh --conf > hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory > --conf > hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator > > 3. Beeline connect with userA and create table tableA. > 4. Beeline connect with userB to truncate tableA > A) In Apache hive, truncate table get exception > Error while compiling statement: FAILED: HiveAccessControlException > Permission denied: Principal [name=userB, type=USER] does not have following > privileges for operation TRUNCATETABLE [[OBJECT OWNERSHIP] on Object > [type=TABLE_OR_VIEW, name=default.tablea]] (state=42000,code=4) > B) In Spark hive, any user that can connect to the hive, can truncate, as > long as the spark user has privileges. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12008) Spark hive security authorization doesn't work as Apache hive's
pin_zhang created SPARK-12008: - Summary: Spark hive security authorization doesn't work as Apache hive's Key: SPARK-12008 URL: https://issues.apache.org/jira/browse/SPARK-12008 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Reporter: pin_zhang Spark hive security authorization doesn't consistent with apache hive The same hive-site.xml hive.security.authorization.enabled true hive.security.authorization.manager org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory hive.security.authenticator.manager org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator hive.server2.enable.doAs true 1. Run spark start-thriftserver.sh, Will meet exception when run sql. SQL standards based authorization should not be enabled from hive cliInstead the use of storage based authorization in hive metastore is reccomended. Set hive.security.authorization.enabled=false to disable authz within cli 2. Change to start start-thriftserver.sh with hive configurations ./start-thriftserver.sh --conf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --conf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator 3. Beeline connect with userA and create table tableA. 4. Beeline connect with userB to truncate tableA A) In Apache hive, truncate table get exception Error while compiling statement: FAILED: HiveAccessControlException Permission denied: Principal [name=userB, type=USER] does not have following privileges for operation TRUNCATETABLE [[OBJECT OWNERSHIP] on Object [type=TABLE_OR_VIEW, name=default.tablea]] (state=42000,code=4) B) In Spark hive, any user that can connect to the hive, can truncate, as long as the spark user has privileges. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11748) Result is null after alter column name of table stored as Parquet
[ https://issues.apache.org/jira/browse/SPARK-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013030#comment-15013030 ] pin_zhang commented on SPARK-11748: --- Apache hive 0.14 has added Support for Parquet Column Rename https://issues.apache.org/jira/browse/HIVE-6938 That doesn't work in spark hive > Result is null after alter column name of table stored as Parquet > -- > > Key: SPARK-11748 > URL: https://issues.apache.org/jira/browse/SPARK-11748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: pin_zhang > > 1. Test with the following code > hctx.sql(" create table " + table + " (id int, str string) STORED AS > PARQUET ") > val df = hctx.jsonFile("g:/vip.json") > df.write.format("parquet").mode(SaveMode.Append).saveAsTable(table) > hctx.sql(" select * from " + table).show() > // alter table > val alter = "alter table " + table + " CHANGE id i_d int " > hctx.sql(alter) > > hctx.sql(" select * from " + table).show() > 2. Result > after change table column name, data in null for the changed column > Result before alter table > +---+---+ > | id|str| > +---+---+ > | 1| s1| > | 2| s2| > +---+---+ > Result after alter table > ++---+ > | i_d|str| > ++---+ > |null| s1| > |null| s2| > ++---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11748) Result is null after alter column name of table stored as Parquet
pin_zhang created SPARK-11748: - Summary: Result is null after alter column name of table stored as Parquet Key: SPARK-11748 URL: https://issues.apache.org/jira/browse/SPARK-11748 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: pin_zhang 1. Test with the following code hctx.sql(" create table " + table + " (id int, str string) STORED AS PARQUET ") val df = hctx.jsonFile("g:/vip.json") df.write.format("parquet").mode(SaveMode.Append).saveAsTable(table) hctx.sql(" select * from " + table).show() // alter table val alter = "alter table " + table + " CHANGE id i_d int " hctx.sql(alter) hctx.sql(" select * from " + table).show() 2. Result after change table column name, data in null for the changed column Result before alter table +---+---+ | id|str| +---+---+ | 1| s1| | 2| s2| +---+---+ Result after alter table ++---+ | i_d|str| ++---+ |null| s1| |null| s2| ++---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10290) Spark can register temp table and hive table with the same table name
pin_zhang created SPARK-10290: - Summary: Spark can register temp table and hive table with the same table name Key: SPARK-10290 URL: https://issues.apache.org/jira/browse/SPARK-10290 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: pin_zhang Spark sql allow to create hive table and register temp table with the same name no way to run query on the hive table table with the following code // register hive table DataFrame df = hctx_.read().json(test.json); df.write().mode(SaveMode.Overwrite).saveAsTable(test); // register temp table hctx_.registerDataFrameAsTable(hctx_.sql(select id from test), test); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9686) Spark hive jdbc client cannot get table from metadata store
[ https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704374#comment-14704374 ] pin_zhang commented on SPARK-9686: -- What's the status of this bug? will it be fixed in 1.4.x? Spark hive jdbc client cannot get table from metadata store --- Key: SPARK-9686 URL: https://issues.apache.org/jira/browse/SPARK-9686 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.4.1 Reporter: pin_zhang Assignee: Cheng Lian 1. Start start-thriftserver.sh 2. connect with beeline 3. create table 4.show tables, the new created table returned 5. Class.forName(org.apache.hive.jdbc.HiveDriver); String URL = jdbc:hive2://localhost:1/default; Properties info = new Properties(); Connection conn = DriverManager.getConnection(URL, info); ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(), null, null, null); Problem: No tables with returned this API, that work in spark1.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9686) Spark hive jdbc client cannot get table from metadata
pin_zhang created SPARK-9686: Summary: Spark hive jdbc client cannot get table from metadata Key: SPARK-9686 URL: https://issues.apache.org/jira/browse/SPARK-9686 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1, 1.4.0 Reporter: pin_zhang 1. Start start-thriftserver.sh 2. connect with beeline 3. create table 4.show tables, the new created table returned 5. Class.forName(org.apache.hive.jdbc.HiveDriver); String URL = jdbc:hive2://localhost:1/default; Properties info = new Properties(); Connection conn = DriverManager.getConnection(URL, info); ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(), null, null, null); Problem: No tables with returned this API, that work in spark1.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9686) Spark hive jdbc client cannot get table from metadata store
[ https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-9686: - Summary: Spark hive jdbc client cannot get table from metadata store (was: Spark hive jdbc client cannot get table from metadata) Spark hive jdbc client cannot get table from metadata store --- Key: SPARK-9686 URL: https://issues.apache.org/jira/browse/SPARK-9686 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.4.1 Reporter: pin_zhang 1. Start start-thriftserver.sh 2. connect with beeline 3. create table 4.show tables, the new created table returned 5. Class.forName(org.apache.hive.jdbc.HiveDriver); String URL = jdbc:hive2://localhost:1/default; Properties info = new Properties(); Connection conn = DriverManager.getConnection(URL, info); ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(), null, null, null); Problem: No tables with returned this API, that work in spark1.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7480) Get exception when DataFrame saveAsTable and run sql on the same table at the same time
pin_zhang created SPARK-7480: Summary: Get exception when DataFrame saveAsTable and run sql on the same table at the same time Key: SPARK-7480 URL: https://issues.apache.org/jira/browse/SPARK-7480 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.3.0 Reporter: pin_zhang There is a case 1) In the main thread call DataFrame.saveAsTable(table, SaveMode.Overwrite); save json rdd to hive table 2) In another thread run sql the table simultaneously You can see many exceptions to indicate the table not exit or table is not complete. Does Spark SQL support such usage? Thanks [Main Thread] DataFrame df = hiveContext_.jsonFile(test.json); String table = UNIT_TEST; while (true) { df = hiveContext_.jsonFile(test.json); df.saveAsTable(table, SaveMode.Overwrite); System.out.println(new Timestamp(System.currentTimeMillis()) + [ +Thread.currentThread().getName() + ] override table); try { Thread.sleep(3000); } catch (InterruptedException e) { e.printStackTrace(); } } [Query Thread] DataFrame query = hiveContext_.sql(select * from UNIT_TEST); Row[] rows = query.collect(); System.out.println(new Timestamp(System.currentTimeMillis()) + [ + Thread.currentThread().getName() + ] [query result count:] + rows.length); [Exceptions in log] 15/05/08 16:05:49 ERROR Hive: NoSuchObjectException(message:default.unit_test table not found) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1560) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) at com.sun.proxy.$Proxy20.get_table(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997) at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy21.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:201) at org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:262) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:161) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:161) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161) at org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:262) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:174) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:186) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:181) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:208) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
[jira] [Commented] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521277#comment-14521277 ] pin_zhang commented on SPARK-6923: -- Hi, Cheng Hao Thanks for your reply! Do you mean if provide a wrapper for datasource api, the Hive Storage Handler can get the data sourced table schema correctly for the external application via Hive API? If so, can it be fixed in Spark 1.3.x? Spark SQL CLI does not read Data Source schema correctly Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang Priority: Blocker {code:java} HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); {code} -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521280#comment-14521280 ] pin_zhang commented on SPARK-6923: -- Hi, Cheng Hao Thanks for your reply! Do you mean if provide a wrapper for datasource api, the Hive Storage Handler can get the data sourced table schema correctly for the external application via Hive API? If so, can it be fixed in Spark 1.3.x? Spark SQL CLI does not read Data Source schema correctly Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang Priority: Blocker {code:java} HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); {code} -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521278#comment-14521278 ] pin_zhang commented on SPARK-6923: -- Hi, Cheng Hao Thanks for your reply! Do you mean if provide a wrapper for datasource api, the Hive Storage Handler can get the data sourced table schema correctly for the external application via Hive API? If so, can it be fixed in Spark 1.3.x? Spark SQL CLI does not read Data Source schema correctly Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang Priority: Blocker {code:java} HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); {code} -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-6923: - Comment: was deleted (was: Hi, Cheng Hao Thanks for your reply! Do you mean if provide a wrapper for datasource api, the Hive Storage Handler can get the data sourced table schema correctly for the external application via Hive API? If so, can it be fixed in Spark 1.3.x? ) Spark SQL CLI does not read Data Source schema correctly Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang Priority: Blocker {code:java} HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); {code} -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-6923: - Comment: was deleted (was: Hi, Cheng Hao Thanks for your reply! Do you mean if provide a wrapper for datasource api, the Hive Storage Handler can get the data sourced table schema correctly for the external application via Hive API? If so, can it be fixed in Spark 1.3.x? ) Spark SQL CLI does not read Data Source schema correctly Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang Priority: Blocker {code:java} HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); {code} -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510381#comment-14510381 ] pin_zhang edited comment on SPARK-6923 at 4/27/15 9:43 AM: --- Hi, Michael Is possible this CLI bug be fixed in Spark1.3? Please help to comment. Thanks was (Author: pin_zhang): Hi, Michael Can this CLI bug be fixed in Spark1.3? Spark SQL CLI does not read Data Source schema correctly Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510381#comment-14510381 ] pin_zhang commented on SPARK-6923: -- Hi, Michael Can this CLI bug be fixed in Spark1.3? Spark SQL CLI does not read Data Source schema correctly Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6923) Get invalid hive table columns after save DataFrame to hive table
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507182#comment-14507182 ] pin_zhang commented on SPARK-6923: -- Hi, Michael Can you help to comment. we have a such usage to query hive table and the table is generated by DataFrame. Get invalid hive table columns after save DataFrame to hive table - Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6923) Get invalid hive table columns after save DataFrame to hive table
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504409#comment-14504409 ] pin_zhang commented on SPARK-6923: -- Hi, Michael We run spark app in Spark1.3, and use the CLIService in HiveServer2 to get the table schema, the call stack to get the schema as below HiveMetaStore$HMSHandler.get_fields(String, String) line: 2873 HiveMetaStore$HMSHandler.get_schema(String, String) line: 2946 NativeMethodAccessorImpl.invoke0(Method, Object, Object[]) line: not available [native method] NativeMethodAccessorImpl.invoke(Object, Object[]) line: 57 DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 43 Method.invoke(Object, Object...) line: 606 RetryingHMSHandler.invoke(Object, Method, Object[]) line: 105 $Proxy9.get_schema(String, String) line: not available HiveMetaStoreClient.getSchema(String, String) line: 1269 GetColumnsOperation.run() line: 139 HiveSessionImplwithUGI(HiveSessionImpl).getColumns(String, String, String, String) line: 359 NativeMethodAccessorImpl.invoke0(Method, Object, Object[]) line: not available [native method] NativeMethodAccessorImpl.invoke(Object, Object[]) line: 57 DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 43 Method.invoke(Object, Object...) line: 606 HiveSessionProxy.invoke(Method, Object[]) line: 79 HiveSessionProxy.access$000(HiveSessionProxy, Method, Object[]) line: 37 HiveSessionProxy$1.run() line: 64 AccessController.doPrivileged(PrivilegedExceptionActionT, AccessControlContext) line: not available [native method] Subject.doAs(Subject, PrivilegedExceptionActionT) line: 415 UserGroupInformation.doAs(PrivilegedExceptionActionT) line: 1548 Hadoop23Shims(HadoopShimsSecure).doAs(UserGroupInformation, PrivilegedExceptionActionT) line: 493 HiveSessionProxy.invoke(Object, Method, Object[]) line: 60 $Proxy17.getColumns(String, String, String, String) line: not available SparkSQLCLIService(CLIService).getColumns(SessionHandle, String, String, String, String) line: 309 ThriftBinaryCLIService(ThriftCLIService).GetColumns(TGetColumnsReq) line: 433 TCLIService$Processor$GetColumnsI.getResult(I, GetColumns_args) line: 1433 TCLIService$Processor$GetColumnsI.getResult(Object, TBase) line: 1418 TCLIService$Processor$GetColumnsI(ProcessFunctionI,T).process(int, TProtocol, TProtocol, I) line: 39 TSetIpAddressProcessorI(TBaseProcessorI).process(TProtocol, TProtocol) line: 39 TSetIpAddressProcessorI.process(TProtocol, TProtocol) line: 55 TThreadPoolServer$WorkerProcess.run() line: 206 ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145 ThreadPoolExecutor$Worker.run() line: 615 Thread.run() line: 745 Don't you think the method should return the same table schema as that you said hctx.table(tableName).schema? Get invalid hive table columns after save DataFrame to hive table - Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6923) Get invalid hive table columns after save DataFrame to hive table
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497701#comment-14497701 ] pin_zhang commented on SPARK-6923: -- In spark1.1.0 client with the jdbc api to get the table schema age(bigint), id(string) while in spark1.3.0 {name=col, type=arraystring} That's not expected. ArrayListMap results = new ArrayList(); DatabaseMetaData meta = cnn.getMetaData(); rsColumns = meta.getColumns(database, null, table, null); while (rsColumns.next()) { Map col = new HashMap(); col.put(name, rsColumns.getString(COLUMN_NAME)); String typeName = rsColumns.getString(TYPE_NAME); col.put(type, typeName); results.add(col); } rsColumns.close(); Get invalid hive table columns after save DataFrame to hive table - Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6923) Get invalid hive table columns after save DataFrame to hive table
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499141#comment-14499141 ] pin_zhang commented on SPARK-6923: -- Do you means if save data frame to the table that use the new datasource api to create table, the hive table won't support the jdbc api DatabaseMetaData .getColumns(database, null, table, null) to get the table columns that corresponding to the data frame fields? Get invalid hive table columns after save DataFrame to hive table - Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6923) Get invalid hive table columns after save DataFrame to hive table
pin_zhang created SPARK-6923: Summary: Get invalid hive table columns after save DataFrame to hive table Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org