Bruce Robbins created SPARK-43149:
-------------------------------------

             Summary: When CREATE USING fails to store metadata in metastore, 
data gets left around
                 Key: SPARK-43149
                 URL: https://issues.apache.org/jira/browse/SPARK-43149
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.0
            Reporter: Bruce Robbins


For example:
{noformat}
drop table if exists parquet_ds1;

-- try creating table with invalid column name
-- use 'using parquet' to designate the data source
create table parquet_ds1 using parquet as
select id, date'2018-01-01' + make_dt_interval(0, id)
from range(0, 10);

Cannot create a table having a column whose name contains commas in Hive 
metastore. Table: `spark_catalog`.`default`.`parquet_ds1`; Column: DATE 
'2018-01-01' + make_dt_interval(0, id, 0, 0.000000)

-- show that table did not get created
show tables;


-- try again with valid column name
-- spark will complain that directory already exists
create table parquet_ds1 using parquet as
select id, date'2018-01-01' + make_dt_interval(0, id) as ts
from range(0, 10);

[LOCATION_ALREADY_EXISTS] Cannot name the managed table as 
`spark_catalog`.`default`.`parquet_ds1`, as its associated location 
'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already 
exists. Please pick a different table name, or remove the existing location 
first.
org.apache.spark.SparkRuntimeException: [LOCATION_ALREADY_EXISTS] Cannot name 
the managed table as `spark_catalog`.`default`.`parquet_ds1`, as its associated 
location 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' 
already exists. Please pick a different table name, or remove the existing 
location first.
        at 
org.apache.spark.sql.errors.QueryExecutionErrors$.locationAlreadyExists(QueryExecutionErrors.scala:2804)
        at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:414)
        at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
...
{noformat}
One must manually remove the directory {{spark-warehouse/parquet_ds1}} before 
the {{create table}} command will succeed.

It seems that datasource table creation runs the data-creation job first, then 
stores the metadata into the metastore.

When using Spark to create Hive tables, the issue does not happen:
{noformat}
drop table if exists parquet_hive1;

-- try creating table with invalid column name,
-- but use 'stored as parquet' instead of 'using'
create table parquet_hive1 stored as parquet as
select id, date'2018-01-01' + make_dt_interval(0, id)
from range(0, 10);

Cannot create a table having a column whose name contains commas in Hive 
metastore. Table: `spark_catalog`.`default`.`parquet_hive1`; Column: DATE 
'2018-01-01' + make_dt_interval(0, id, 0, 0.000000)

-- try again with valid column name. This will succeed;
create table parquet_hive1 stored as parquet as
select id, date'2018-01-01' + make_dt_interval(0, id) as ts
from range(0, 10);
{noformat}

It seems that Hive table creation stores metadata into the metastore first, 
then runs the data-creation job.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to