[jira] [Resolved] (DRILL-4519) File system directory-based partition pruning doesn't work correctly with parquet metadata

2017-07-03 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec resolved DRILL-4519.
-
Resolution: Fixed

Fixed probably as part of some other fix. Haven't observed it since 1.6.0...

> File system directory-based partition pruning doesn't work correctly with 
> parquet metadata
> --
>
> Key: DRILL-4519
> URL: https://issues.apache.org/jira/browse/DRILL-4519
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.4.0, 1.5.0, 1.6.0
>Reporter: Miroslav Holubec
>
> We have parquet files in folders with following convention /MM/DD/HH.
> Without drill's parquet metadata directory prunning works seamlessly.
> {noformat}
> select dir0, dir1, dir2 from hdfs.test.indexed;
> dir0 = ,  dir1 = MM, dir2 = DD, dir3 = HH
> {noformat}
> After creating metadata and executing same query, dir0 contains HH folder 
> name instead yearly folder name. dir1...3 are null.
> {noformat}
> refresh table metadata hdfs.test.indexed;
> select dir0, dir1, dir2 from hdfs.test.indexed;
> dir0 = HH,  dir1 = null, dir2 = null, dir3 = null
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DRILL-4464) Apache Drill cannot read parquet generated outside Drill: Reading past RLE/BitPacking stream

2016-03-03 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4464:

Affects Version/s: 1.4.0
  Description: 
When I generate file using MapReduce and parquet 1.8.1 (or 1.8.1-drill-r0), 
which contains REQUIRED INT64 field, I'm not able to read this column in drill, 
but I'm able to read full content using parquet-tools cat/dump. This doesn't 
happened every time, it is input data dependant (so probably different encoding 
is chosen by parquet for given column?).

Error reported by drill:
{noformat}
2016-03-02 03:01:16,354 [29296305-abe2-f4bd-ded0-27bb53f631f0:frag:3:0] ERROR 
o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IllegalArgumentException: 
Reading past RLE/BitPacking stream.

Fragment 3:0

[Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on drssc9a4:31010]
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
IllegalArgumentException: Reading past RLE/BitPacking stream.

Fragment 3:0

[Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on drssc9a4:31010]
at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534)
 ~[drill-common-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:321)
 [drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184)
 [drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290)
 [drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) 
[drill-common-1.4.0.jar:1.4.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_40]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_40]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: Error in 
parquet record reader.
Message:
Hadoop path: /tmp/tmp.gz.parquet
Total records read: 131070
Mock records read: 0
Records to read: 21845
Row group index: 0
Records in row group: 2418197
Parquet Metadata: ParquetMetaData{FileMetaData{schema: message nat {
  required int64 ts;
  required int32 dr;
  optional binary ui (UTF8);
  optional int32 up;
  optional binary ri (UTF8);
  optional int32 rp;
  optional binary di (UTF8);
  optional int32 dp;
  required int32 pr;
  optional int64 ob;
  optional int64 ib;
}
, metadata: {}}, blocks: [BlockMetaData{2418197, 30601003 [ColumnMetaData{GZIP 
[ts] INT64  [PLAIN_DICTIONARY, BIT_PACKED, PLAIN], 4}, ColumnMetaData{GZIP [dr] 
INT32  [PLAIN_DICTIONARY, BIT_PACKED], 2630991}, ColumnMetaData{GZIP [ui] 
BINARY  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 2964867}, ColumnMetaData{GZIP [up] 
INT32  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 2966955}, ColumnMetaData{GZIP [ri] 
BINARY  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 7481618}, ColumnMetaData{GZIP [rp] 
INT32  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 7483706}, ColumnMetaData{GZIP [di] 
BINARY  [RLE, BIT_PACKED, PLAIN], 11995191}, ColumnMetaData{GZIP [dp] INT32  
[RLE, BIT_PACKED, PLAIN], 11995247}, ColumnMetaData{GZIP [pr] INT32  
[PLAIN_DICTIONARY, BIT_PACKED], 11995303}, ColumnMetaData{GZIP [ob] INT64  
[PLAIN_DICTIONARY, RLE, BIT_PACKED], 11995930}, ColumnMetaData{GZIP [ib] INT64  
[PLAIN_DICTIONARY, RLE, BIT_PACKED], 11999527}]}]}
at 
org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise(ParquetRecordReader.java:345)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next(ParquetRecordReader.java:447)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:191) 
~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:132)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) 
~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:93)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
  

[jira] [Comment Edited] (DRILL-4464) Apache Drill cannot read parquet generated outside Drill: Reading past RLE/BitPacking stream

2016-03-03 Thread Miroslav Holubec (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175566#comment-15175566
 ] 

Miroslav Holubec edited comment on DRILL-4464 at 3/3/16 9:01 AM:
-

output from MR-tools meta. TS column is causing an issue:
{noformat}
$ java -jar c:\devel\parquet-mr\parquet-tools\target\parquet-tools-1.8.1.jar 
meta tmp.gz.parquet
file:file:/tmp/tmp.gz.parquet
creator: parquet-mr version 1.8.1 (build 
4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)

file schema: nat

ts:  REQUIRED INT64 R:0 D:0
dr:  REQUIRED INT32 R:0 D:0
ui:  OPTIONAL BINARY O:UTF8 R:0 D:1
up:  OPTIONAL INT32 R:0 D:1
ri:  OPTIONAL BINARY O:UTF8 R:0 D:1
rp:  OPTIONAL INT32 R:0 D:1
di:  OPTIONAL BINARY O:UTF8 R:0 D:1
dp:  OPTIONAL INT32 R:0 D:1
pr:  REQUIRED INT32 R:0 D:0
ob:  OPTIONAL INT64 R:0 D:1
ib:  OPTIONAL INT64 R:0 D:1

row group 1: RC:2418197 TS:30601003 OFFSET:4

ts:   INT64 GZIP DO:0 FPO:4 SZ:2630987/19172128/7.29 VC:2418197 
ENC:BIT_PACKED,PLAIN,PLAIN_DICTIONARY
dr:   INT32 GZIP DO:0 FPO:2630991 SZ:333876/1197646/3.59 VC:2418197 
ENC:BIT_PACKED,PLAIN_DICTIONARY
ui:   BINARY GZIP DO:0 FPO:2964867 SZ:2088/1565/0.75 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
up:   INT32 GZIP DO:0 FPO:2966955 SZ:4514663/4652474/1.03 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
ri:   BINARY GZIP DO:0 FPO:7481618 SZ:2088/1565/0.75 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
rp:   INT32 GZIP DO:0 FPO:7483706 SZ:4511485/4652474/1.03 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
di:   BINARY GZIP DO:0 FPO:11995191 SZ:56/36/0.64 VC:2418197 
ENC:BIT_PACKED,PLAIN,RLE
dp:   INT32 GZIP DO:0 FPO:11995247 SZ:56/36/0.64 VC:2418197 
ENC:BIT_PACKED,PLAIN,RLE
pr:   INT32 GZIP DO:0 FPO:11995303 SZ:627/407/0.65 VC:2418197 
ENC:BIT_PACKED,PLAIN_DICTIONARY
ob:   INT64 GZIP DO:0 FPO:11995930 SZ:3597/3998/1.11 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
ib:   INT64 GZIP DO:0 FPO:11999527 SZ:292939/918674/3.14 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
{noformat}


was (Author: myroch):
output from MR-tools meta. TS column is causing an issue:
{noformat}
java -jar c:\devel\parquet-mr\parquet-tools\target\parquet-tools-1.8.1.jar meta 
tmp.gz.parquet
file:file:/C:/smaz/tmp.gz.parquet
creator: parquet-mr version 1.8.1 (build 
4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)

file schema: nat

ts:  REQUIRED INT64 R:0 D:0
dr:  REQUIRED INT32 R:0 D:0
ui:  OPTIONAL BINARY O:UTF8 R:0 D:1
up:  OPTIONAL INT32 R:0 D:1
ri:  OPTIONAL BINARY O:UTF8 R:0 D:1
rp:  OPTIONAL INT32 R:0 D:1
di:  OPTIONAL BINARY O:UTF8 R:0 D:1
dp:  OPTIONAL INT32 R:0 D:1
pr:  REQUIRED INT32 R:0 D:0
ob:  OPTIONAL INT64 R:0 D:1
ib:  OPTIONAL INT64 R:0 D:1

row group 1: RC:2418197 TS:30601003 OFFSET:4

ts:   INT64 GZIP DO:0 FPO:4 SZ:2630987/19172128/7.29 VC:2418197 
ENC:BIT_PACKED,PLAIN,PLAIN_DICTIONARY
dr:   INT32 GZIP DO:0 FPO:2630991 SZ:333876/1197646/3.59 VC:2418197 
ENC:BIT_PACKED,PLAIN_DICTIONARY
ui:   BINARY GZIP DO:0 FPO:2964867 SZ:2088/1565/0.75 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
up:   INT32 GZIP DO:0 FPO:2966955 SZ:4514663/4652474/1.03 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
ri:   BINARY GZIP DO:0 FPO:7481618 SZ:2088/1565/0.75 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
rp:   INT32 GZIP DO:0 FPO:7483706 SZ:4511485/4652474/1.03 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
di:   BINARY GZIP DO:0 FPO:11995191 SZ:56/36/0.64 VC:2418197 
ENC:BIT_PACKED,PLAIN,RLE
dp:   INT32 GZIP DO:0 FPO:11995247 SZ:56/36/0.64 VC:2418197 
ENC:BIT_PACKED,PLAIN,RLE
pr:   INT32 GZIP DO:0 FPO:11995303 SZ:627/407/0.65 VC:2418197 
ENC:BIT_PACKED,PLAIN_DICTIONARY
ob:   INT64 GZIP DO:0 FPO:11995930 SZ:3597/3998/1.11 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
ib:   INT64 GZIP DO:0 FPO:11999527 SZ:292939/918674/3.14 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
{noformat}

> Apache Drill cannot read parquet generated outside Drill: Reading past 
> RLE/BitPacking stream
> 
>
> Key: DRILL-4464
> URL: https://issues.apache.org/jira/browse/DRILL-4464
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.4.0, 1.5.0
>

[jira] [Created] (DRILL-4519) File system directory-based partition pruning doesn't work correctly with parquet metadata

2016-03-18 Thread Miroslav Holubec (JIRA)
Miroslav Holubec created DRILL-4519:
---

 Summary: File system directory-based partition pruning doesn't 
work correctly with parquet metadata
 Key: DRILL-4519
 URL: https://issues.apache.org/jira/browse/DRILL-4519
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.5.0, 1.4.0
Reporter: Miroslav Holubec


We have parquet files in folders with following convention /MM/DD/HH.
Without drill's parquet metadata directory prunning works seamlessly.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = ,  dir1 = MM, dir2 = DD, dir3 = HH
{noformat}
After creating metadata and querying root folder, dir0 contains HH folder name 
instead yearly folder name. dir1...4 are null.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = HH,  dir1 = null, dir2 = null, dir3 = null
{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DRILL-4519) File system directory-based partition pruning doesn't work correctly with parquet metadata

2016-03-19 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4519:

Description: 
We have parquet files in folders with following convention /MM/DD/HH.
Without drill's parquet metadata directory prunning works seamlessly.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = ,  dir1 = MM, dir2 = DD, dir3 = HH
{noformat}
After creating metadata and executing same query, dir0 contains HH folder name 
instead yearly folder name. dir1...3 are null.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = HH,  dir1 = null, dir2 = null, dir3 = null
{noformat}



  was:
We have parquet files in folders with following convention /MM/DD/HH.
Without drill's parquet metadata directory prunning works seamlessly.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = ,  dir1 = MM, dir2 = DD, dir3 = HH
{noformat}
After creating metadata and executing same query, dir0 contains HH folder name 
instead yearly folder name. dir1...4 are null.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = HH,  dir1 = null, dir2 = null, dir3 = null
{noformat}




> File system directory-based partition pruning doesn't work correctly with 
> parquet metadata
> --
>
> Key: DRILL-4519
> URL: https://issues.apache.org/jira/browse/DRILL-4519
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Miroslav Holubec
>
> We have parquet files in folders with following convention /MM/DD/HH.
> Without drill's parquet metadata directory prunning works seamlessly.
> {noformat}
> select dir0, dir1, dir2 from hdfs.test.indexed;
> dir0 = ,  dir1 = MM, dir2 = DD, dir3 = HH
> {noformat}
> After creating metadata and executing same query, dir0 contains HH folder 
> name instead yearly folder name. dir1...3 are null.
> {noformat}
> select dir0, dir1, dir2 from hdfs.test.indexed;
> dir0 = HH,  dir1 = null, dir2 = null, dir3 = null
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DRILL-4519) File system directory-based partition pruning doesn't work correctly with parquet metadata

2016-03-19 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4519:

Description: 
We have parquet files in folders with following convention /MM/DD/HH.
Without drill's parquet metadata directory prunning works seamlessly.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = ,  dir1 = MM, dir2 = DD, dir3 = HH
{noformat}
After creating metadata and executing same query, dir0 contains HH folder name 
instead yearly folder name. dir1...3 are null.
{noformat}
refresh table metadata hdfs.test.indexed;
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = HH,  dir1 = null, dir2 = null, dir3 = null
{noformat}



  was:
We have parquet files in folders with following convention /MM/DD/HH.
Without drill's parquet metadata directory prunning works seamlessly.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = ,  dir1 = MM, dir2 = DD, dir3 = HH
{noformat}
After creating metadata and executing same query, dir0 contains HH folder name 
instead yearly folder name. dir1...3 are null.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = HH,  dir1 = null, dir2 = null, dir3 = null
{noformat}




> File system directory-based partition pruning doesn't work correctly with 
> parquet metadata
> --
>
> Key: DRILL-4519
> URL: https://issues.apache.org/jira/browse/DRILL-4519
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Miroslav Holubec
>
> We have parquet files in folders with following convention /MM/DD/HH.
> Without drill's parquet metadata directory prunning works seamlessly.
> {noformat}
> select dir0, dir1, dir2 from hdfs.test.indexed;
> dir0 = ,  dir1 = MM, dir2 = DD, dir3 = HH
> {noformat}
> After creating metadata and executing same query, dir0 contains HH folder 
> name instead yearly folder name. dir1...3 are null.
> {noformat}
> refresh table metadata hdfs.test.indexed;
> select dir0, dir1, dir2 from hdfs.test.indexed;
> dir0 = HH,  dir1 = null, dir2 = null, dir3 = null
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DRILL-4519) File system directory-based partition pruning doesn't work correctly with parquet metadata

2016-03-20 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4519:

Description: 
We have parquet files in folders with following convention /MM/DD/HH.
Without drill's parquet metadata directory prunning works seamlessly.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = ,  dir1 = MM, dir2 = DD, dir3 = HH
{noformat}
After creating metadata and executing same query, dir0 contains HH folder name 
instead yearly folder name. dir1...4 are null.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = HH,  dir1 = null, dir2 = null, dir3 = null
{noformat}



  was:
We have parquet files in folders with following convention /MM/DD/HH.
Without drill's parquet metadata directory prunning works seamlessly.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = ,  dir1 = MM, dir2 = DD, dir3 = HH
{noformat}
After creating metadata and querying root folder, dir0 contains HH folder name 
instead yearly folder name. dir1...4 are null.
{noformat}
select dir0, dir1, dir2 from hdfs.test.indexed;
dir0 = HH,  dir1 = null, dir2 = null, dir3 = null
{noformat}




> File system directory-based partition pruning doesn't work correctly with 
> parquet metadata
> --
>
> Key: DRILL-4519
> URL: https://issues.apache.org/jira/browse/DRILL-4519
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Miroslav Holubec
>
> We have parquet files in folders with following convention /MM/DD/HH.
> Without drill's parquet metadata directory prunning works seamlessly.
> {noformat}
> select dir0, dir1, dir2 from hdfs.test.indexed;
> dir0 = ,  dir1 = MM, dir2 = DD, dir3 = HH
> {noformat}
> After creating metadata and executing same query, dir0 contains HH folder 
> name instead yearly folder name. dir1...4 are null.
> {noformat}
> select dir0, dir1, dir2 from hdfs.test.indexed;
> dir0 = HH,  dir1 = null, dir2 = null, dir3 = null
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DRILL-4519) File system directory-based partition pruning doesn't work correctly with parquet metadata

2016-04-11 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4519:

Affects Version/s: 1.6.0

> File system directory-based partition pruning doesn't work correctly with 
> parquet metadata
> --
>
> Key: DRILL-4519
> URL: https://issues.apache.org/jira/browse/DRILL-4519
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.4.0, 1.5.0, 1.6.0
>Reporter: Miroslav Holubec
>
> We have parquet files in folders with following convention /MM/DD/HH.
> Without drill's parquet metadata directory prunning works seamlessly.
> {noformat}
> select dir0, dir1, dir2 from hdfs.test.indexed;
> dir0 = ,  dir1 = MM, dir2 = DD, dir3 = HH
> {noformat}
> After creating metadata and executing same query, dir0 contains HH folder 
> name instead yearly folder name. dir1...3 are null.
> {noformat}
> refresh table metadata hdfs.test.indexed;
> select dir0, dir1, dir2 from hdfs.test.indexed;
> dir0 = HH,  dir1 = null, dir2 = null, dir3 = null
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DRILL-4464) Apache Drill cannot read parquet generated outside Drill: Reading past RLE/BitPacking stream

2016-04-11 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4464:

Affects Version/s: 1.6.0

> Apache Drill cannot read parquet generated outside Drill: Reading past 
> RLE/BitPacking stream
> 
>
> Key: DRILL-4464
> URL: https://issues.apache.org/jira/browse/DRILL-4464
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.4.0, 1.5.0, 1.6.0
>Reporter: Miroslav Holubec
> Attachments: tmp.gz.parquet
>
>
> When I generate file using MapReduce and parquet 1.8.1 (or 1.8.1-drill-r0), 
> which contains REQUIRED INT64 field, I'm not able to read this column in 
> drill, but I'm able to read full content using parquet-tools cat/dump. This 
> doesn't happened every time, it is input data dependant (so probably 
> different encoding is chosen by parquet for given column?).
> Error reported by drill:
> {noformat}
> 2016-03-02 03:01:16,354 [29296305-abe2-f4bd-ded0-27bb53f631f0:frag:3:0] ERROR 
> o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IllegalArgumentException: 
> Reading past RLE/BitPacking stream.
> Fragment 3:0
> [Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on drssc9a4:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> IllegalArgumentException: Reading past RLE/BitPacking stream.
> Fragment 3:0
> [Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on drssc9a4:31010]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534)
>  ~[drill-common-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:321)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.4.0.jar:1.4.0]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_40]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_40]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
> Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: Error in 
> parquet record reader.
> Message:
> Hadoop path: /tmp/tmp.gz.parquet
> Total records read: 131070
> Mock records read: 0
> Records to read: 21845
> Row group index: 0
> Records in row group: 2418197
> Parquet Metadata: ParquetMetaData{FileMetaData{schema: message nat {
>   required int64 ts;
>   required int32 dr;
>   optional binary ui (UTF8);
>   optional int32 up;
>   optional binary ri (UTF8);
>   optional int32 rp;
>   optional binary di (UTF8);
>   optional int32 dp;
>   required int32 pr;
>   optional int64 ob;
>   optional int64 ib;
> }
> , metadata: {}}, blocks: [BlockMetaData{2418197, 30601003 
> [ColumnMetaData{GZIP [ts] INT64  [PLAIN_DICTIONARY, BIT_PACKED, PLAIN], 4}, 
> ColumnMetaData{GZIP [dr] INT32  [PLAIN_DICTIONARY, BIT_PACKED], 2630991}, 
> ColumnMetaData{GZIP [ui] BINARY  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 
> 2964867}, ColumnMetaData{GZIP [up] INT32  [PLAIN_DICTIONARY, RLE, 
> BIT_PACKED], 2966955}, ColumnMetaData{GZIP [ri] BINARY  [PLAIN_DICTIONARY, 
> RLE, BIT_PACKED], 7481618}, ColumnMetaData{GZIP [rp] INT32  
> [PLAIN_DICTIONARY, RLE, BIT_PACKED], 7483706}, ColumnMetaData{GZIP [di] 
> BINARY  [RLE, BIT_PACKED, PLAIN], 11995191}, ColumnMetaData{GZIP [dp] INT32  
> [RLE, BIT_PACKED, PLAIN], 11995247}, ColumnMetaData{GZIP [pr] INT32  
> [PLAIN_DICTIONARY, BIT_PACKED], 11995303}, ColumnMetaData{GZIP [ob] INT64  
> [PLAIN_DICTIONARY, RLE, BIT_PACKED], 11995930}, ColumnMetaData{GZIP [ib] 
> INT64  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 11999527}]}]}
> at 
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise(ParquetRecordReader.java:345)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next(ParquetRecordReader.java:447)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:191) 
> ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(Ab

[jira] [Commented] (DRILL-1950) Implement filter pushdown for Parquet

2016-04-11 Thread Miroslav Holubec (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234935#comment-15234935
 ] 

Miroslav Holubec commented on DRILL-1950:
-

Any update on this?

> Implement filter pushdown for Parquet
> -
>
> Key: DRILL-1950
> URL: https://issues.apache.org/jira/browse/DRILL-1950
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Jason Altekruse
>Assignee: Jacques Nadeau
>Priority: Critical
> Fix For: 1.7.0
>
> Attachments: DRILL-1950.1.patch.txt
>
>
> The parquet reader currently supports project pushdown, for limiting the 
> number of columns read, however it does not use filter pushdown to read a 
> subset of the requested columns. This is particularly useful with parquet 
> files that contain statistics, most importantly min and max values on pages. 
> Evaluating predicates against these values could save some major reading and 
> decoding time.
> The largest barrier to implementing this is the current design of the reader. 
> Firstly, we currently have two separate parquet readers, one for reading flat 
> files very quickly and another or reading complex data. There are 
> enhancements we can make the the flat reader, to make it support nested data 
> in a much more efficient manner. However the speed of the flat file reader 
> currently comes from being able to make vectorized copies out the the parquet 
> file. This design is somewhat at odds with filter pushdown, as we will only 
> can make useful vectorized copies if the filter matches a large run of values 
> within the file. This might not be too rare a case, assuming files are often 
> somewhat sorted on a primary field like date or a numeric key, and these are 
> often fields used to limit the query to a subset of the data. However for 
> cases where we are filter out a few records here and there, we should just 
> make individual copies.
> We need to do more design work on the best way to balance performance with 
> these use cases in mind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4601) Partitioning based on the parquet statistics

2016-04-13 Thread Miroslav Holubec (JIRA)
Miroslav Holubec created DRILL-4601:
---

 Summary: Partitioning based on the parquet statistics
 Key: DRILL-4601
 URL: https://issues.apache.org/jira/browse/DRILL-4601
 Project: Apache Drill
  Issue Type: Improvement
  Components: Query Planning & Optimization
Reporter: Miroslav Holubec


It can really help performance to extend current partitioning idea implemented 
in DRILL- even further.
Currently partitioning is based on statistics, when min value equals to max 
value for whole file. Based on this files are removed from scan in planning 
phase. Problem with this is, that it leads to many small parquet files, which 
is not fine in HDFS world. Also only few columns are partitioned.

I would like to extend this idea to use all statistics for all columns. So if 
value should equal to constant, remove all files from plan which have 
statistics off. This will really help performance for scans over many parquet 
files.

I have initial patch ready, currently just to give an idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DRILL-4601) Partitioning based on the parquet statistics

2016-04-13 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4601:

Description: 
It can really help performance to extend current partitioning idea implemented 
in DRILL- even further.
Currently partitioning is based on statistics, when min value equals to max 
value for whole file. Based on this files are removed from scan in planning 
phase. Problem with this is, that it leads to many small parquet files, which 
is not fine in HDFS world. Also only few columns are partitioned.

I would like to extend this idea to use all statistics for all columns. So if 
value should equal to constant, remove all files from plan which have 
statistics off. This will really help performance for scans over many parquet 
files.

I have initial patch ready, currently just to give an idea (it is reusing 
metadata v2)

  was:
It can really help performance to extend current partitioning idea implemented 
in DRILL- even further.
Currently partitioning is based on statistics, when min value equals to max 
value for whole file. Based on this files are removed from scan in planning 
phase. Problem with this is, that it leads to many small parquet files, which 
is not fine in HDFS world. Also only few columns are partitioned.

I would like to extend this idea to use all statistics for all columns. So if 
value should equal to constant, remove all files from plan which have 
statistics off. This will really help performance for scans over many parquet 
files.

I have initial patch ready, currently just to give an idea.


> Partitioning based on the parquet statistics
> 
>
> Key: DRILL-4601
> URL: https://issues.apache.org/jira/browse/DRILL-4601
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Query Planning & Optimization
>Reporter: Miroslav Holubec
>  Labels: parquet, partitioning, planning, statistics
>
> It can really help performance to extend current partitioning idea 
> implemented in DRILL- even further.
> Currently partitioning is based on statistics, when min value equals to max 
> value for whole file. Based on this files are removed from scan in planning 
> phase. Problem with this is, that it leads to many small parquet files, which 
> is not fine in HDFS world. Also only few columns are partitioned.
> I would like to extend this idea to use all statistics for all columns. So if 
> value should equal to constant, remove all files from plan which have 
> statistics off. This will really help performance for scans over many parquet 
> files.
> I have initial patch ready, currently just to give an idea (it is reusing 
> metadata v2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DRILL-4601) Partitioning based on the parquet statistics

2016-04-13 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4601:

Attachment: DRILL-4601.1.patch

> Partitioning based on the parquet statistics
> 
>
> Key: DRILL-4601
> URL: https://issues.apache.org/jira/browse/DRILL-4601
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Query Planning & Optimization
>Reporter: Miroslav Holubec
>  Labels: parquet, partitioning, planning, statistics
> Attachments: DRILL-4601.1.patch
>
>
> It can really help performance to extend current partitioning idea 
> implemented in DRILL- even further.
> Currently partitioning is based on statistics, when min value equals to max 
> value for whole file. Based on this files are removed from scan in planning 
> phase. Problem with this is, that it leads to many small parquet files, which 
> is not fine in HDFS world. Also only few columns are partitioned.
> I would like to extend this idea to use all statistics for all columns. So if 
> value should equal to constant, remove all files from plan which have 
> statistics off. This will really help performance for scans over many parquet 
> files.
> I have initial patch ready, currently just to give an idea (it is reusing 
> metadata v2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DRILL-4601) Partitioning based on the parquet statistics

2016-04-13 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4601:

Description: 
It can really help performance to extend current partitioning idea implemented 
in DRILL- even further.
Currently partitioning is based on statistics, when min value equals to max 
value for whole file. Based on this, files are removed from scan in planning 
phase. Problem is, that it leads to many small parquet files, which is not fine 
in HDFS world. Also only few columns are partitioned.

I would like to extend this idea to use all statistics for all columns. So if 
value should equal to constant, remove all files from plan which have 
statistics off. This will really help performance for scans over many parquet 
files.

I have initial patch ready, currently just to give an idea. (it changes 
metadata v2, which is not fine and also it currently supports only equal 
operation).

  was:
It can really help performance to extend current partitioning idea implemented 
in DRILL- even further.
Currently partitioning is based on statistics, when min value equals to max 
value for whole file. Based on this files are removed from scan in planning 
phase. Problem with this is, that it leads to many small parquet files, which 
is not fine in HDFS world. Also only few columns are partitioned.

I would like to extend this idea to use all statistics for all columns. So if 
value should equal to constant, remove all files from plan which have 
statistics off. This will really help performance for scans over many parquet 
files.

I have initial patch ready, currently just to give an idea (it is reusing 
metadata v2)


> Partitioning based on the parquet statistics
> 
>
> Key: DRILL-4601
> URL: https://issues.apache.org/jira/browse/DRILL-4601
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Query Planning & Optimization
>Reporter: Miroslav Holubec
>  Labels: parquet, partitioning, planning, statistics
> Attachments: DRILL-4601.1.patch
>
>
> It can really help performance to extend current partitioning idea 
> implemented in DRILL- even further.
> Currently partitioning is based on statistics, when min value equals to max 
> value for whole file. Based on this, files are removed from scan in planning 
> phase. Problem is, that it leads to many small parquet files, which is not 
> fine in HDFS world. Also only few columns are partitioned.
> I would like to extend this idea to use all statistics for all columns. So if 
> value should equal to constant, remove all files from plan which have 
> statistics off. This will really help performance for scans over many parquet 
> files.
> I have initial patch ready, currently just to give an idea. (it changes 
> metadata v2, which is not fine and also it currently supports only equal 
> operation).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4601) Partitioning based on the parquet statistics

2016-06-21 Thread Miroslav Holubec (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341441#comment-15341441
 ] 

Miroslav Holubec commented on DRILL-4601:
-

current patch in github: https://github.com/myroch/drill

> Partitioning based on the parquet statistics
> 
>
> Key: DRILL-4601
> URL: https://issues.apache.org/jira/browse/DRILL-4601
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Query Planning & Optimization
>Reporter: Miroslav Holubec
>  Labels: parquet, partitioning, planning, statistics
> Attachments: DRILL-4601.1.patch
>
>
> It can really help performance to extend current partitioning idea 
> implemented in DRILL- even further.
> Currently partitioning is based on statistics, when min value equals to max 
> value for whole file. Based on this, files are removed from scan in planning 
> phase. Problem is, that it leads to many small parquet files, which is not 
> fine in HDFS world. Also only few columns are partitioned.
> I would like to extend this idea to use all statistics for all columns. So if 
> value should equal to constant, remove all files from plan which have 
> statistics off. This will really help performance for scans over many parquet 
> files.
> I have initial patch ready, currently just to give an idea. (it changes 
> metadata v2, which is not fine and also it currently supports only equal 
> operation).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DRILL-4601) Partitioning based on the parquet statistics

2016-06-21 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4601:

Description: 
It can really help performance to extend current partitioning idea implemented 
in DRILL- even further.
Currently partitioning is based on statistics, when min value equals to max 
value for whole file. Based on this, files are removed from scan in planning 
phase. Problem is, that it leads to many small parquet files, which is not fine 
in HDFS world. Also only few columns are partitioned.

I would like to extend this idea to use all statistics for all columns. So if 
value should equal to constant, remove all files from plan which have 
statistics off. This will really help performance for scans over many parquet 
files.

I have initial patch ready, currently just to give an idea. (it changes 
metadata v2, which is not fine).

  was:
It can really help performance to extend current partitioning idea implemented 
in DRILL- even further.
Currently partitioning is based on statistics, when min value equals to max 
value for whole file. Based on this, files are removed from scan in planning 
phase. Problem is, that it leads to many small parquet files, which is not fine 
in HDFS world. Also only few columns are partitioned.

I would like to extend this idea to use all statistics for all columns. So if 
value should equal to constant, remove all files from plan which have 
statistics off. This will really help performance for scans over many parquet 
files.

I have initial patch ready, currently just to give an idea. (it changes 
metadata v2, which is not fine and also it currently supports only equal 
operation).


> Partitioning based on the parquet statistics
> 
>
> Key: DRILL-4601
> URL: https://issues.apache.org/jira/browse/DRILL-4601
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Query Planning & Optimization
>Reporter: Miroslav Holubec
>  Labels: parquet, partitioning, planning, statistics
> Attachments: DRILL-4601.1.patch
>
>
> It can really help performance to extend current partitioning idea 
> implemented in DRILL- even further.
> Currently partitioning is based on statistics, when min value equals to max 
> value for whole file. Based on this, files are removed from scan in planning 
> phase. Problem is, that it leads to many small parquet files, which is not 
> fine in HDFS world. Also only few columns are partitioned.
> I would like to extend this idea to use all statistics for all columns. So if 
> value should equal to constant, remove all files from plan which have 
> statistics off. This will really help performance for scans over many parquet 
> files.
> I have initial patch ready, currently just to give an idea. (it changes 
> metadata v2, which is not fine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4601) Partitioning based on the parquet statistics

2016-06-21 Thread Miroslav Holubec (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341447#comment-15341447
 ] 

Miroslav Holubec commented on DRILL-4601:
-

[~jacq...@dremio.com], [~jaltekruse], [~sphillips]: any inputs?

> Partitioning based on the parquet statistics
> 
>
> Key: DRILL-4601
> URL: https://issues.apache.org/jira/browse/DRILL-4601
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Query Planning & Optimization
>Reporter: Miroslav Holubec
>  Labels: parquet, partitioning, planning, statistics
> Attachments: DRILL-4601.1.patch
>
>
> It can really help performance to extend current partitioning idea 
> implemented in DRILL- even further.
> Currently partitioning is based on statistics, when min value equals to max 
> value for whole file. Based on this, files are removed from scan in planning 
> phase. Problem is, that it leads to many small parquet files, which is not 
> fine in HDFS world. Also only few columns are partitioned.
> I would like to extend this idea to use all statistics for all columns. So if 
> value should equal to constant, remove all files from plan which have 
> statistics off. This will really help performance for scans over many parquet 
> files.
> I have initial patch ready, currently just to give an idea. (it changes 
> metadata v2, which is not fine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4849) Refresh table metadata performance: read only new/updated parquet files

2016-08-17 Thread Miroslav Holubec (JIRA)
Miroslav Holubec created DRILL-4849:
---

 Summary: Refresh table metadata performance: read only new/updated 
parquet files
 Key: DRILL-4849
 URL: https://issues.apache.org/jira/browse/DRILL-4849
 Project: Apache Drill
  Issue Type: Improvement
  Components: Query Planning & Optimization
Affects Versions: 1.7.0
Reporter: Miroslav Holubec


Currently REFRESH TABLE METADATA takes serious amount of time for many small 
parquet files. We can actually only read these parquets, which are new or 
changed. This will require to add modificationTime into file metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4849) Refresh table metadata performance: read only new/updated parquet files

2016-08-17 Thread Miroslav Holubec (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15424170#comment-15424170
 ] 

Miroslav Holubec commented on DRILL-4849:
-

patch on branch including DRILL-4601:
https://github.com/myroch/drill/commit/14c499ab74538f618952b0578a72b9f23785fe5d

m.

> Refresh table metadata performance: read only new/updated parquet files
> ---
>
> Key: DRILL-4849
> URL: https://issues.apache.org/jira/browse/DRILL-4849
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Query Planning & Optimization
>Affects Versions: 1.7.0
>Reporter: Miroslav Holubec
>  Labels: metadata, parquet, performance, planning
>
> Currently REFRESH TABLE METADATA takes serious amount of time for many small 
> parquet files. We can actually only read these parquets, which are new or 
> changed. This will require to add modificationTime into file metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DRILL-4464) Apache Drill cannot read parquet generated outside Drill

2016-03-02 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4464:

Description: 
When I generate file using MapReduce and parquet 1.8.1 (or 1.8.1-drill-r0), 
which contains REQUIRED INT64 field, I'm not able to read this column in drill, 
but I'm able to read full content using parquet-tools cat/dump. This doesn't 
happened every time, it is input data dependant (so probably different encoding 
is chosen by parquet for given column?).

Error reported by drill:
{noformat}
2016-03-02 03:01:16,354 [29296305-abe2-f4bd-ded0-27bb53f631f0:frag:3:0] ERROR 
o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IllegalArgumentException: 
Reading past RLE/BitPacking stream.

Fragment 3:0

[Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on 
drssc9a4.st.ishisystems.com:31010]
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
IllegalArgumentException: Reading past RLE/BitPacking stream.

Fragment 3:0

[Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on 
drssc9a4.st.ishisystems.com:31010]
at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534)
 ~[drill-common-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:321)
 [drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184)
 [drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290)
 [drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) 
[drill-common-1.4.0.jar:1.4.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_40]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_40]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: Error in 
parquet record reader.
Message:
Hadoop path: /tmp/tmp.gz.parquet
Total records read: 131070
Mock records read: 0
Records to read: 21845
Row group index: 0
Records in row group: 2418197
Parquet Metadata: ParquetMetaData{FileMetaData{schema: message nat {
  required int64 ts;
  required int32 dr;
  optional binary ui (UTF8);
  optional int32 up;
  optional binary ri (UTF8);
  optional int32 rp;
  optional binary di (UTF8);
  optional int32 dp;
  required int32 pr;
  optional int64 ob;
  optional int64 ib;
}
, metadata: {}}, blocks: [BlockMetaData{2418197, 30601003 [ColumnMetaData{GZIP 
[ts] INT64  [PLAIN_DICTIONARY, BIT_PACKED, PLAIN], 4}, ColumnMetaData{GZIP [dr] 
INT32  [PLAIN_DICTIONARY, BIT_PACKED], 2630991}, ColumnMetaData{GZIP [ui] 
BINARY  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 2964867}, ColumnMetaData{GZIP [up] 
INT32  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 2966955}, ColumnMetaData{GZIP [ri] 
BINARY  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 7481618}, ColumnMetaData{GZIP [rp] 
INT32  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 7483706}, ColumnMetaData{GZIP [di] 
BINARY  [RLE, BIT_PACKED, PLAIN], 11995191}, ColumnMetaData{GZIP [dp] INT32  
[RLE, BIT_PACKED, PLAIN], 11995247}, ColumnMetaData{GZIP [pr] INT32  
[PLAIN_DICTIONARY, BIT_PACKED], 11995303}, ColumnMetaData{GZIP [ob] INT64  
[PLAIN_DICTIONARY, RLE, BIT_PACKED], 11995930}, ColumnMetaData{GZIP [ib] INT64  
[PLAIN_DICTIONARY, RLE, BIT_PACKED], 11999527}]}]}
at 
org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise(ParquetRecordReader.java:345)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next(ParquetRecordReader.java:447)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:191) 
~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:132)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) 
~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:93)
 ~[drill-java-exec-1.4.0.jar:1.4.

[jira] [Created] (DRILL-4464) Apache Drill cannot read parquet generated outside Drill

2016-03-02 Thread Miroslav Holubec (JIRA)
Miroslav Holubec created DRILL-4464:
---

 Summary: Apache Drill cannot read parquet generated outside Drill
 Key: DRILL-4464
 URL: https://issues.apache.org/jira/browse/DRILL-4464
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Miroslav Holubec


When I generate file using MapReduce and parquet 1.8.1 (or 1.8.1-drill-r0), 
which contains some REQUIRED INT64 fields, I'm not able to read this column in 
drill, but I'm able to read full content using parquet-tools cat/dump. This 
doesn't happened every time, it is input data dependant (so probably different 
encoding is chosen by parquet for given column?).

Error reported by drill:
{noformat}
2016-03-02 03:01:16,354 [29296305-abe2-f4bd-ded0-27bb53f631f0:frag:3:0] ERROR 
o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IllegalArgumentException: 
Reading past RLE/BitPacking stream.

Fragment 3:0

[Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on 
drssc9a4.st.ishisystems.com:31010]
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
IllegalArgumentException: Reading past RLE/BitPacking stream.

Fragment 3:0

[Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on 
drssc9a4.st.ishisystems.com:31010]
at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534)
 ~[drill-common-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:321)
 [drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184)
 [drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290)
 [drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) 
[drill-common-1.4.0.jar:1.4.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_40]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_40]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: Error in 
parquet record reader.
Message:
Hadoop path: /tmp/tmp.gz.parquet
Total records read: 131070
Mock records read: 0
Records to read: 21845
Row group index: 0
Records in row group: 2418197
Parquet Metadata: ParquetMetaData{FileMetaData{schema: message nat {
  required int64 ts;
  required int32 dr;
  optional binary ui (UTF8);
  optional int32 up;
  optional binary ri (UTF8);
  optional int32 rp;
  optional binary di (UTF8);
  optional int32 dp;
  required int32 pr;
  optional int64 ob;
  optional int64 ib;
}
, metadata: {}}, blocks: [BlockMetaData{2418197, 30601003 [ColumnMetaData{GZIP 
[ts] INT64  [PLAIN_DICTIONARY, BIT_PACKED, PLAIN], 4}, ColumnMetaData{GZIP [dr] 
INT32  [PLAIN_DICTIONARY, BIT_PACKED], 2630991}, ColumnMetaData{GZIP [ui] 
BINARY  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 2964867}, ColumnMetaData{GZIP [up] 
INT32  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 2966955}, ColumnMetaData{GZIP [ri] 
BINARY  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 7481618}, ColumnMetaData{GZIP [rp] 
INT32  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 7483706}, ColumnMetaData{GZIP [di] 
BINARY  [RLE, BIT_PACKED, PLAIN], 11995191}, ColumnMetaData{GZIP [dp] INT32  
[RLE, BIT_PACKED, PLAIN], 11995247}, ColumnMetaData{GZIP [pr] INT32  
[PLAIN_DICTIONARY, BIT_PACKED], 11995303}, ColumnMetaData{GZIP [ob] INT64  
[PLAIN_DICTIONARY, RLE, BIT_PACKED], 11995930}, ColumnMetaData{GZIP [ib] INT64  
[PLAIN_DICTIONARY, RLE, BIT_PACKED], 11999527}]}]}
at 
org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise(ParquetRecordReader.java:345)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next(ParquetRecordReader.java:447)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:191) 
~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:132)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
 ~[drill-java-exec-1.4.0.jar:1.4.0]
at 
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) 
~[drill-java-exec-1

[jira] [Updated] (DRILL-4464) Apache Drill cannot read parquet generated outside Drill: Reading past RLE/BitPacking stream

2016-03-02 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4464:

Summary: Apache Drill cannot read parquet generated outside Drill: Reading 
past RLE/BitPacking stream  (was: Apache Drill cannot read parquet generated 
outside Drill)

> Apache Drill cannot read parquet generated outside Drill: Reading past 
> RLE/BitPacking stream
> 
>
> Key: DRILL-4464
> URL: https://issues.apache.org/jira/browse/DRILL-4464
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Miroslav Holubec
>
> When I generate file using MapReduce and parquet 1.8.1 (or 1.8.1-drill-r0), 
> which contains REQUIRED INT64 field, I'm not able to read this column in 
> drill, but I'm able to read full content using parquet-tools cat/dump. This 
> doesn't happened every time, it is input data dependant (so probably 
> different encoding is chosen by parquet for given column?).
> Error reported by drill:
> {noformat}
> 2016-03-02 03:01:16,354 [29296305-abe2-f4bd-ded0-27bb53f631f0:frag:3:0] ERROR 
> o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IllegalArgumentException: 
> Reading past RLE/BitPacking stream.
> Fragment 3:0
> [Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on 
> drssc9a4.st.ishisystems.com:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> IllegalArgumentException: Reading past RLE/BitPacking stream.
> Fragment 3:0
> [Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on 
> drssc9a4.st.ishisystems.com:31010]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534)
>  ~[drill-common-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:321)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.4.0.jar:1.4.0]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_40]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_40]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
> Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: Error in 
> parquet record reader.
> Message:
> Hadoop path: /tmp/tmp.gz.parquet
> Total records read: 131070
> Mock records read: 0
> Records to read: 21845
> Row group index: 0
> Records in row group: 2418197
> Parquet Metadata: ParquetMetaData{FileMetaData{schema: message nat {
>   required int64 ts;
>   required int32 dr;
>   optional binary ui (UTF8);
>   optional int32 up;
>   optional binary ri (UTF8);
>   optional int32 rp;
>   optional binary di (UTF8);
>   optional int32 dp;
>   required int32 pr;
>   optional int64 ob;
>   optional int64 ib;
> }
> , metadata: {}}, blocks: [BlockMetaData{2418197, 30601003 
> [ColumnMetaData{GZIP [ts] INT64  [PLAIN_DICTIONARY, BIT_PACKED, PLAIN], 4}, 
> ColumnMetaData{GZIP [dr] INT32  [PLAIN_DICTIONARY, BIT_PACKED], 2630991}, 
> ColumnMetaData{GZIP [ui] BINARY  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 
> 2964867}, ColumnMetaData{GZIP [up] INT32  [PLAIN_DICTIONARY, RLE, 
> BIT_PACKED], 2966955}, ColumnMetaData{GZIP [ri] BINARY  [PLAIN_DICTIONARY, 
> RLE, BIT_PACKED], 7481618}, ColumnMetaData{GZIP [rp] INT32  
> [PLAIN_DICTIONARY, RLE, BIT_PACKED], 7483706}, ColumnMetaData{GZIP [di] 
> BINARY  [RLE, BIT_PACKED, PLAIN], 11995191}, ColumnMetaData{GZIP [dp] INT32  
> [RLE, BIT_PACKED, PLAIN], 11995247}, ColumnMetaData{GZIP [pr] INT32  
> [PLAIN_DICTIONARY, BIT_PACKED], 11995303}, ColumnMetaData{GZIP [ob] INT64  
> [PLAIN_DICTIONARY, RLE, BIT_PACKED], 11995930}, ColumnMetaData{GZIP [ib] 
> INT64  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 11999527}]}]}
> at 
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise(ParquetRecordReader.java:345)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next(ParquetRecordReader.java:447)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:191) 
> ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBat

[jira] [Comment Edited] (DRILL-4464) Apache Drill cannot read parquet generated outside Drill: Reading past RLE/BitPacking stream

2016-03-02 Thread Miroslav Holubec (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175566#comment-15175566
 ] 

Miroslav Holubec edited comment on DRILL-4464 at 3/2/16 1:14 PM:
-

output from MR-tools meta. TS column is causing an issue:
{noformat}
java -jar c:\devel\parquet-mr\parquet-tools\target\parquet-tools-1.8.1.jar meta 
tmp.gz.parquet
file:file:/C:/smaz/tmp.gz.parquet
creator: parquet-mr version 1.8.1 (build 
4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)

file schema: nat

ts:  REQUIRED INT64 R:0 D:0
dr:  REQUIRED INT32 R:0 D:0
ui:  OPTIONAL BINARY O:UTF8 R:0 D:1
up:  OPTIONAL INT32 R:0 D:1
ri:  OPTIONAL BINARY O:UTF8 R:0 D:1
rp:  OPTIONAL INT32 R:0 D:1
di:  OPTIONAL BINARY O:UTF8 R:0 D:1
dp:  OPTIONAL INT32 R:0 D:1
pr:  REQUIRED INT32 R:0 D:0
ob:  OPTIONAL INT64 R:0 D:1
ib:  OPTIONAL INT64 R:0 D:1

row group 1: RC:2418197 TS:30601003 OFFSET:4

ts:   INT64 GZIP DO:0 FPO:4 SZ:2630987/19172128/7.29 VC:2418197 
ENC:BIT_PACKED,PLAIN,PLAIN_DICTIONARY
dr:   INT32 GZIP DO:0 FPO:2630991 SZ:333876/1197646/3.59 VC:2418197 
ENC:BIT_PACKED,PLAIN_DICTIONARY
ui:   BINARY GZIP DO:0 FPO:2964867 SZ:2088/1565/0.75 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
up:   INT32 GZIP DO:0 FPO:2966955 SZ:4514663/4652474/1.03 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
ri:   BINARY GZIP DO:0 FPO:7481618 SZ:2088/1565/0.75 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
rp:   INT32 GZIP DO:0 FPO:7483706 SZ:4511485/4652474/1.03 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
di:   BINARY GZIP DO:0 FPO:11995191 SZ:56/36/0.64 VC:2418197 
ENC:BIT_PACKED,PLAIN,RLE
dp:   INT32 GZIP DO:0 FPO:11995247 SZ:56/36/0.64 VC:2418197 
ENC:BIT_PACKED,PLAIN,RLE
pr:   INT32 GZIP DO:0 FPO:11995303 SZ:627/407/0.65 VC:2418197 
ENC:BIT_PACKED,PLAIN_DICTIONARY
ob:   INT64 GZIP DO:0 FPO:11995930 SZ:3597/3998/1.11 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
ib:   INT64 GZIP DO:0 FPO:11999527 SZ:292939/918674/3.14 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
{noformat}


was (Author: myroch):
output from MR-tools meta. TS column is causing an issue:
{noformat}
C:\smaz>java -jar 
c:\devel\parquet-mr\parquet-tools\target\parquet-tools-1.8.1.jar meta 
tmp.gz.parquet
file:file:/C:/smaz/tmp.gz.parquet
creator: parquet-mr version 1.8.1 (build 
4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)

file schema: nat

ts:  REQUIRED INT64 R:0 D:0
dr:  REQUIRED INT32 R:0 D:0
ui:  OPTIONAL BINARY O:UTF8 R:0 D:1
up:  OPTIONAL INT32 R:0 D:1
ri:  OPTIONAL BINARY O:UTF8 R:0 D:1
rp:  OPTIONAL INT32 R:0 D:1
di:  OPTIONAL BINARY O:UTF8 R:0 D:1
dp:  OPTIONAL INT32 R:0 D:1
pr:  REQUIRED INT32 R:0 D:0
ob:  OPTIONAL INT64 R:0 D:1
ib:  OPTIONAL INT64 R:0 D:1

row group 1: RC:2418197 TS:30601003 OFFSET:4

ts:   INT64 GZIP DO:0 FPO:4 SZ:2630987/19172128/7.29 VC:2418197 
ENC:BIT_PACKED,PLAIN,PLAIN_DICTIONARY
dr:   INT32 GZIP DO:0 FPO:2630991 SZ:333876/1197646/3.59 VC:2418197 
ENC:BIT_PACKED,PLAIN_DICTIONARY
ui:   BINARY GZIP DO:0 FPO:2964867 SZ:2088/1565/0.75 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
up:   INT32 GZIP DO:0 FPO:2966955 SZ:4514663/4652474/1.03 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
ri:   BINARY GZIP DO:0 FPO:7481618 SZ:2088/1565/0.75 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
rp:   INT32 GZIP DO:0 FPO:7483706 SZ:4511485/4652474/1.03 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
di:   BINARY GZIP DO:0 FPO:11995191 SZ:56/36/0.64 VC:2418197 
ENC:BIT_PACKED,PLAIN,RLE
dp:   INT32 GZIP DO:0 FPO:11995247 SZ:56/36/0.64 VC:2418197 
ENC:BIT_PACKED,PLAIN,RLE
pr:   INT32 GZIP DO:0 FPO:11995303 SZ:627/407/0.65 VC:2418197 
ENC:BIT_PACKED,PLAIN_DICTIONARY
ob:   INT64 GZIP DO:0 FPO:11995930 SZ:3597/3998/1.11 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
ib:   INT64 GZIP DO:0 FPO:11999527 SZ:292939/918674/3.14 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
{noformat}

> Apache Drill cannot read parquet generated outside Drill: Reading past 
> RLE/BitPacking stream
> 
>
> Key: DRILL-4464
> URL: https://issues.apache.org/jira/browse/DRILL-4464
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.5.0
>

[jira] [Commented] (DRILL-4464) Apache Drill cannot read parquet generated outside Drill: Reading past RLE/BitPacking stream

2016-03-02 Thread Miroslav Holubec (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175566#comment-15175566
 ] 

Miroslav Holubec commented on DRILL-4464:
-

output from MR-tools meta. TS column is causing an issue:
{noformat}
C:\smaz>java -jar 
c:\devel\parquet-mr\parquet-tools\target\parquet-tools-1.8.1.jar meta 
tmp.gz.parquet
file:file:/C:/smaz/tmp.gz.parquet
creator: parquet-mr version 1.8.1 (build 
4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)

file schema: nat

ts:  REQUIRED INT64 R:0 D:0
dr:  REQUIRED INT32 R:0 D:0
ui:  OPTIONAL BINARY O:UTF8 R:0 D:1
up:  OPTIONAL INT32 R:0 D:1
ri:  OPTIONAL BINARY O:UTF8 R:0 D:1
rp:  OPTIONAL INT32 R:0 D:1
di:  OPTIONAL BINARY O:UTF8 R:0 D:1
dp:  OPTIONAL INT32 R:0 D:1
pr:  REQUIRED INT32 R:0 D:0
ob:  OPTIONAL INT64 R:0 D:1
ib:  OPTIONAL INT64 R:0 D:1

row group 1: RC:2418197 TS:30601003 OFFSET:4

ts:   INT64 GZIP DO:0 FPO:4 SZ:2630987/19172128/7.29 VC:2418197 
ENC:BIT_PACKED,PLAIN,PLAIN_DICTIONARY
dr:   INT32 GZIP DO:0 FPO:2630991 SZ:333876/1197646/3.59 VC:2418197 
ENC:BIT_PACKED,PLAIN_DICTIONARY
ui:   BINARY GZIP DO:0 FPO:2964867 SZ:2088/1565/0.75 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
up:   INT32 GZIP DO:0 FPO:2966955 SZ:4514663/4652474/1.03 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
ri:   BINARY GZIP DO:0 FPO:7481618 SZ:2088/1565/0.75 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
rp:   INT32 GZIP DO:0 FPO:7483706 SZ:4511485/4652474/1.03 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
di:   BINARY GZIP DO:0 FPO:11995191 SZ:56/36/0.64 VC:2418197 
ENC:BIT_PACKED,PLAIN,RLE
dp:   INT32 GZIP DO:0 FPO:11995247 SZ:56/36/0.64 VC:2418197 
ENC:BIT_PACKED,PLAIN,RLE
pr:   INT32 GZIP DO:0 FPO:11995303 SZ:627/407/0.65 VC:2418197 
ENC:BIT_PACKED,PLAIN_DICTIONARY
ob:   INT64 GZIP DO:0 FPO:11995930 SZ:3597/3998/1.11 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
ib:   INT64 GZIP DO:0 FPO:11999527 SZ:292939/918674/3.14 VC:2418197 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
{noformat}

> Apache Drill cannot read parquet generated outside Drill: Reading past 
> RLE/BitPacking stream
> 
>
> Key: DRILL-4464
> URL: https://issues.apache.org/jira/browse/DRILL-4464
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Miroslav Holubec
>
> When I generate file using MapReduce and parquet 1.8.1 (or 1.8.1-drill-r0), 
> which contains REQUIRED INT64 field, I'm not able to read this column in 
> drill, but I'm able to read full content using parquet-tools cat/dump. This 
> doesn't happened every time, it is input data dependant (so probably 
> different encoding is chosen by parquet for given column?).
> Error reported by drill:
> {noformat}
> 2016-03-02 03:01:16,354 [29296305-abe2-f4bd-ded0-27bb53f631f0:frag:3:0] ERROR 
> o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IllegalArgumentException: 
> Reading past RLE/BitPacking stream.
> Fragment 3:0
> [Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on 
> drssc9a4.st.ishisystems.com:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> IllegalArgumentException: Reading past RLE/BitPacking stream.
> Fragment 3:0
> [Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on 
> drssc9a4.st.ishisystems.com:31010]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534)
>  ~[drill-common-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:321)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.4.0.jar:1.4.0]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_40]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_40]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
> Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: Error in 
> parquet record reader.
> Message:
> Hadoop path: /tmp/tmp.gz.parquet
> Total records read: 131070
> Mock records read: 0
> Records to read: 2184

[jira] [Updated] (DRILL-4464) Apache Drill cannot read parquet generated outside Drill: Reading past RLE/BitPacking stream

2016-03-02 Thread Miroslav Holubec (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4464:

Attachment: tmp.gz.parquet

> Apache Drill cannot read parquet generated outside Drill: Reading past 
> RLE/BitPacking stream
> 
>
> Key: DRILL-4464
> URL: https://issues.apache.org/jira/browse/DRILL-4464
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Miroslav Holubec
> Attachments: tmp.gz.parquet
>
>
> When I generate file using MapReduce and parquet 1.8.1 (or 1.8.1-drill-r0), 
> which contains REQUIRED INT64 field, I'm not able to read this column in 
> drill, but I'm able to read full content using parquet-tools cat/dump. This 
> doesn't happened every time, it is input data dependant (so probably 
> different encoding is chosen by parquet for given column?).
> Error reported by drill:
> {noformat}
> 2016-03-02 03:01:16,354 [29296305-abe2-f4bd-ded0-27bb53f631f0:frag:3:0] ERROR 
> o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IllegalArgumentException: 
> Reading past RLE/BitPacking stream.
> Fragment 3:0
> [Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on 
> drssc9a4.st.ishisystems.com:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> IllegalArgumentException: Reading past RLE/BitPacking stream.
> Fragment 3:0
> [Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on 
> drssc9a4.st.ishisystems.com:31010]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534)
>  ~[drill-common-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:321)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.4.0.jar:1.4.0]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_40]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_40]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
> Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: Error in 
> parquet record reader.
> Message:
> Hadoop path: /tmp/tmp.gz.parquet
> Total records read: 131070
> Mock records read: 0
> Records to read: 21845
> Row group index: 0
> Records in row group: 2418197
> Parquet Metadata: ParquetMetaData{FileMetaData{schema: message nat {
>   required int64 ts;
>   required int32 dr;
>   optional binary ui (UTF8);
>   optional int32 up;
>   optional binary ri (UTF8);
>   optional int32 rp;
>   optional binary di (UTF8);
>   optional int32 dp;
>   required int32 pr;
>   optional int64 ob;
>   optional int64 ib;
> }
> , metadata: {}}, blocks: [BlockMetaData{2418197, 30601003 
> [ColumnMetaData{GZIP [ts] INT64  [PLAIN_DICTIONARY, BIT_PACKED, PLAIN], 4}, 
> ColumnMetaData{GZIP [dr] INT32  [PLAIN_DICTIONARY, BIT_PACKED], 2630991}, 
> ColumnMetaData{GZIP [ui] BINARY  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 
> 2964867}, ColumnMetaData{GZIP [up] INT32  [PLAIN_DICTIONARY, RLE, 
> BIT_PACKED], 2966955}, ColumnMetaData{GZIP [ri] BINARY  [PLAIN_DICTIONARY, 
> RLE, BIT_PACKED], 7481618}, ColumnMetaData{GZIP [rp] INT32  
> [PLAIN_DICTIONARY, RLE, BIT_PACKED], 7483706}, ColumnMetaData{GZIP [di] 
> BINARY  [RLE, BIT_PACKED, PLAIN], 11995191}, ColumnMetaData{GZIP [dp] INT32  
> [RLE, BIT_PACKED, PLAIN], 11995247}, ColumnMetaData{GZIP [pr] INT32  
> [PLAIN_DICTIONARY, BIT_PACKED], 11995303}, ColumnMetaData{GZIP [ob] INT64  
> [PLAIN_DICTIONARY, RLE, BIT_PACKED], 11995930}, ColumnMetaData{GZIP [ib] 
> INT64  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 11999527}]}]}
> at 
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise(ParquetRecordReader.java:345)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next(ParquetRecordReader.java:447)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:191) 
> ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.record.Abstra

[jira] [Commented] (DRILL-4464) Apache Drill cannot read parquet generated outside Drill: Reading past RLE/BitPacking stream

2016-03-02 Thread Miroslav Holubec (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175571#comment-15175571
 ] 

Miroslav Holubec commented on DRILL-4464:
-

Uploaded file. Query failing on it:
select sum(ts) from tmp.gz.parquet

> Apache Drill cannot read parquet generated outside Drill: Reading past 
> RLE/BitPacking stream
> 
>
> Key: DRILL-4464
> URL: https://issues.apache.org/jira/browse/DRILL-4464
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Miroslav Holubec
> Attachments: tmp.gz.parquet
>
>
> When I generate file using MapReduce and parquet 1.8.1 (or 1.8.1-drill-r0), 
> which contains REQUIRED INT64 field, I'm not able to read this column in 
> drill, but I'm able to read full content using parquet-tools cat/dump. This 
> doesn't happened every time, it is input data dependant (so probably 
> different encoding is chosen by parquet for given column?).
> Error reported by drill:
> {noformat}
> 2016-03-02 03:01:16,354 [29296305-abe2-f4bd-ded0-27bb53f631f0:frag:3:0] ERROR 
> o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IllegalArgumentException: 
> Reading past RLE/BitPacking stream.
> Fragment 3:0
> [Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on 
> drssc9a4.st.ishisystems.com:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> IllegalArgumentException: Reading past RLE/BitPacking stream.
> Fragment 3:0
> [Error Id: e2d02152-1b67-4c9f-9cb1-bd2b9ff302d8 on 
> drssc9a4.st.ishisystems.com:31010]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534)
>  ~[drill-common-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:321)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290)
>  [drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.4.0.jar:1.4.0]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_40]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_40]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
> Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: Error in 
> parquet record reader.
> Message:
> Hadoop path: /tmp/tmp.gz.parquet
> Total records read: 131070
> Mock records read: 0
> Records to read: 21845
> Row group index: 0
> Records in row group: 2418197
> Parquet Metadata: ParquetMetaData{FileMetaData{schema: message nat {
>   required int64 ts;
>   required int32 dr;
>   optional binary ui (UTF8);
>   optional int32 up;
>   optional binary ri (UTF8);
>   optional int32 rp;
>   optional binary di (UTF8);
>   optional int32 dp;
>   required int32 pr;
>   optional int64 ob;
>   optional int64 ib;
> }
> , metadata: {}}, blocks: [BlockMetaData{2418197, 30601003 
> [ColumnMetaData{GZIP [ts] INT64  [PLAIN_DICTIONARY, BIT_PACKED, PLAIN], 4}, 
> ColumnMetaData{GZIP [dr] INT32  [PLAIN_DICTIONARY, BIT_PACKED], 2630991}, 
> ColumnMetaData{GZIP [ui] BINARY  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 
> 2964867}, ColumnMetaData{GZIP [up] INT32  [PLAIN_DICTIONARY, RLE, 
> BIT_PACKED], 2966955}, ColumnMetaData{GZIP [ri] BINARY  [PLAIN_DICTIONARY, 
> RLE, BIT_PACKED], 7481618}, ColumnMetaData{GZIP [rp] INT32  
> [PLAIN_DICTIONARY, RLE, BIT_PACKED], 7483706}, ColumnMetaData{GZIP [di] 
> BINARY  [RLE, BIT_PACKED, PLAIN], 11995191}, ColumnMetaData{GZIP [dp] INT32  
> [RLE, BIT_PACKED, PLAIN], 11995247}, ColumnMetaData{GZIP [pr] INT32  
> [PLAIN_DICTIONARY, BIT_PACKED], 11995303}, ColumnMetaData{GZIP [ob] INT64  
> [PLAIN_DICTIONARY, RLE, BIT_PACKED], 11995930}, ColumnMetaData{GZIP [ib] 
> INT64  [PLAIN_DICTIONARY, RLE, BIT_PACKED], 11999527}]}]}
> at 
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise(ParquetRecordReader.java:345)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next(ParquetRecordReader.java:447)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:191) 
> ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.4.0.jar:1.4.0]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.j