Bruce Robbins created SPARK-43149: ------------------------------------- Summary: When CREATE USING fails to store metadata in metastore, data gets left around Key: SPARK-43149 URL: https://issues.apache.org/jira/browse/SPARK-43149 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins
For example: {noformat} drop table if exists parquet_ds1; -- try creating table with invalid column name -- use 'using parquet' to designate the data source create table parquet_ds1 using parquet as select id, date'2018-01-01' + make_dt_interval(0, id) from range(0, 10); Cannot create a table having a column whose name contains commas in Hive metastore. Table: `spark_catalog`.`default`.`parquet_ds1`; Column: DATE '2018-01-01' + make_dt_interval(0, id, 0, 0.000000) -- show that table did not get created show tables; -- try again with valid column name -- spark will complain that directory already exists create table parquet_ds1 using parquet as select id, date'2018-01-01' + make_dt_interval(0, id) as ts from range(0, 10); [LOCATION_ALREADY_EXISTS] Cannot name the managed table as `spark_catalog`.`default`.`parquet_ds1`, as its associated location 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already exists. Please pick a different table name, or remove the existing location first. org.apache.spark.SparkRuntimeException: [LOCATION_ALREADY_EXISTS] Cannot name the managed table as `spark_catalog`.`default`.`parquet_ds1`, as its associated location 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already exists. Please pick a different table name, or remove the existing location first. at org.apache.spark.sql.errors.QueryExecutionErrors$.locationAlreadyExists(QueryExecutionErrors.scala:2804) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:414) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) ... {noformat} One must manually remove the directory {{spark-warehouse/parquet_ds1}} before the {{create table}} command will succeed. It seems that datasource table creation runs the data-creation job first, then stores the metadata into the metastore. When using Spark to create Hive tables, the issue does not happen: {noformat} drop table if exists parquet_hive1; -- try creating table with invalid column name, -- but use 'stored as parquet' instead of 'using' create table parquet_hive1 stored as parquet as select id, date'2018-01-01' + make_dt_interval(0, id) from range(0, 10); Cannot create a table having a column whose name contains commas in Hive metastore. Table: `spark_catalog`.`default`.`parquet_hive1`; Column: DATE '2018-01-01' + make_dt_interval(0, id, 0, 0.000000) -- try again with valid column name. This will succeed; create table parquet_hive1 stored as parquet as select id, date'2018-01-01' + make_dt_interval(0, id) as ts from range(0, 10); {noformat} It seems that Hive table creation stores metadata into the metastore first, then runs the data-creation job. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org