Robert V created DRILL-6209:
-------------------------------

             Summary: Spark generated Parquet file reading fails when 
'store.parquet.reader.int96_as_timestamp' is used
                 Key: DRILL-6209
                 URL: https://issues.apache.org/jira/browse/DRILL-6209
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.13.0
         Environment: * Parquet files, that failed to query, were generated 
using Apache Spark 2.2.1 on AWS EMR. The Spark SQL library was used.
 * Drill was set up on a Mac OS El Capitan system, running Java 8.
            Reporter: Robert V
         Attachments: error-stacktrace.txt, successful-log.txt

Parquet files generated using Apache Spark 2.2.1 Timestamp column type might 
fail when the 'store.parquet.reader.int96_as_timestamp' option is used.

Query that fails:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = TRUE;

SELECT t.* FROM 
dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
 t;
{code}


Query that succeeds:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = FALSE;

SELECT CONVERT_FROM(t.date_time, 'TIMESTAMP_IMPALA') AS ts, t.* FROM 
dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
 t;
{code}


See logs attached.
I'm not able to upload sample Parquet files because they contain sensitive 
information.
Parquet files, generated using an Apache Spark job by AWS EMR, failed. File 
sizes are in the range of hundreds of megabytes.
Parquet files, generated by a local Spark installation, worked however. They 
only contained a few rows so it wasn't an accurate comparison with the larger 
data set.
 
The bug is present in the current master branch (1.13.0 candidate version).
 
This issue is related to 
[DRILL-5097|https://issues.apache.org/jira/browse/DRILL-5097]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to