[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2018-03-29 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419338#comment-16419338
 ] 

Zoltan Ivanfi commented on SPARK-20297:
---

Sorry, commented to the wrong JIRA.

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2018-03-29 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419335#comment-16419335
 ] 

Zoltan Ivanfi commented on SPARK-20297:
---

Could you please clarify how those DECIMALS were written in the first place?
 * If some manual configuration was done to allow Spark to choose this 
representation, then we are fine.
 * If an upstream Spark version wrote data using this representation by 
default, that's a valid reason to feel mildly uncomfortable.
 * If a downstream Spark version wrote data using this representation by 
default, then we should open a JIRA to prevent CDH Spark from doing so until 
Hive and Impala supports it.

Thanks!

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-19 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975763#comment-15975763
 ] 

Hyukjin Kwon commented on SPARK-20297:
--

Yea, then it looks both ways are still vaild whether it is newer or older. 
Spark supports both ways when writing out via the option I described above.

It looks other systems provided here are not able to read int-based decimals 
that complies Parquet specification.

If the point to leave this open is to keep this discussion, we could leave this 
resolved or open a JIRA in Parquet side.

I am resolving this. Please reopen this if I misunderstood.



> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-19 Thread Tim Armstrong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975559#comment-15975559
 ] 

Tim Armstrong commented on SPARK-20297:
---

The standard doesn't say that smaller decimals *have* to be stored in 
int32/int64, it just is an option for subset of decimal types. int32 and int64 
are valid representations for a subset of decimal types. fixed_len_byte_array 
and binary are a valid representation of any decimal type.

he int32/int64 options were present in the original version of the decimal 
spec, they just weren't widely implemented: 
https://github.com/Parquet/parquet-format/commit/b2836e591da8216cfca47075baee2c9a7b0b9289
 . So its not a new/old version thing, it was just an alternative 
representation that many systems didn't implement.

Not really sure what my point is regarding Spark, but just wanted to leave this 
here so future people reading JIRA don't misunderstand what the Parquet spec 
says.

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965273#comment-15965273
 ] 

Hyukjin Kwon commented on SPARK-20297:
--

Let me leave some pointers about related PRs - 
https://github.com/apache/spark/pull/8566 and 
https://github.com/apache/spark/pull/8566. cc [~liancheng]

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965268#comment-15965268
 ] 

Hyukjin Kwon commented on SPARK-20297:
--

Oh wait, I am sorry. It does follow the newer standard - 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal 
and I missed the documentation.
Wouldn't it be then bugs in Impala or Hive?

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965265#comment-15965265
 ] 

Hyukjin Kwon commented on SPARK-20297:
--

Thank you so much for trying out [~mmokhtar]. Do you maybe think this JIRA is 
resolvable maybe?

Up to my knowledge, this option means to follow Parquet's specification rather 
than the current way used by Spark. So, if other implementation follows 
Parquet's specification, I guess this is the correct option for compatibility.

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Mostafa Mokhtar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965255#comment-15965255
 ] 

Mostafa Mokhtar commented on SPARK-20297:
-

[~hyukjin.kwon]
Data written by Spark is readable by Hive and Impala when 
spark.sql.parquet.writeLegacyFormat is enabled. 

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965245#comment-15965245
 ] 

Hyukjin Kwon commented on SPARK-20297:
--

For me, it sounds like related with {{spark.sql.parquet.writeLegacyFormat}}. I 
haven't tested and double-checked it by myself but I assume Hive guesses the 
decimal as fixed-bytes but Spark actually writes out them as INT32 for 1 <= 
precision <= 9 and INT64 for 10 <= precision <= 18.

Do you mind if I ask to try out with {{spark.sql.parquet.writeLegacyFormat}} 
enabled?

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>Priority: Critical
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org