xsys created HIVE-26533: --------------------------- Summary: Column data type is lost when an Avro table with a BYTE column is written through spark-sql Key: HIVE-26533 URL: https://issues.apache.org/jira/browse/HIVE-26533 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Affects Versions: 3.1.2 Reporter: xsys
h3. Describe the bug We are trying to store a table through the {{spark-sql}} interface with the {{Avro}} file format. The table's schema contains a column with the {{BYTE}} data type. Additionally, the column's name contains uppercase letters. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}), we see the below message: {code:java} WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.{code} Finally, when we perform a {{DESC}} on the table, we observe that the {{BYTE}} data type has been converted to {{{}int{}}}, and the case sensitivity of the column name has been lost (it is converted to lowercase). h3. Step to reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} spark-sql> create table hive_tinyint_avro(c0 INT, C1 BYTE) ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.avro.AvroSerDe" STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat"; 22/08/28 15:44:21 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Time taken: 0.359 seconds spark-sql> insert into hive_tinyint_avro select 0, cast(-128 as byte); 22/08/28 15:44:28 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. 22/08/28 15:44:29 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. Time taken: 1.605 seconds spark-sql> desc hive_tinyint_avro; 22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. 22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. c0 int c1 int // Data type and case-sensitivity lost Time taken: 0.068 seconds, Fetched 2 row(s){code} h3. Expected behavior We expect the case sensitivity and data type to be preserved. We tried other formats like Parquet & ORC and the outcome is consistent with this expectation. Here are the logs from our attempt at doing the same with Parquet: {noformat} spark-sql> create table hive_tinyint_parquet(c0 INT, C1 BYTE) stored as PARQUET; Time taken: 0.134 seconds spark-sql> insert into hive_tinyint_parquet select 0, cast(-128 as byte); Time taken: 0.995 seconds spark-sql> desc hive_tinyint_parquet; c0 int C1 tinyint // Data type and case-sensitivity preserved Time taken: 0.092 seconds, Fetched 2 row(s){noformat} h3. Root Cause [TypeInfoToSchema|https://github.com/apache/hive/blob/8190d2be7b7165effa62bd21b7d60ef81fb0e4af/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L41]'s [createAvroPrimitive|https://github.com/apache/hive/blob/rel/release-3.1.2/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L124-L132] is where Hive's BYTE, SHORT & INT are all converted into Avro's INT: {code:java} case BYTE: schema = Schema.create(Schema.Type.INT); break; case SHORT: schema = Schema.create(Schema.Type.INT); break; case INT: schema = Schema.create(Schema.Type.INT); break; {code} Once converted into Avro schema, we lose track of the actual Hive schema specified by the user. Therefore, once TINYINT/BYTE is converted into INT, the former is lost in the AvroSerde instance. -- This message was sent by Atlassian Jira (v8.20.10#820010)