[jira] [Commented] (PARQUET-152) Encoding issue with fixed length byte arrays
[ https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677297#comment-17677297 ] Antoine Pitrou commented on PARQUET-152: It would be nice if the encodings spec had been updated as well, because for now it mentions that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY columns, not FIXED_LEN_BYTE_ARRAY. See PARQUET-2231. > Encoding issue with fixed length byte arrays > > > Key: PARQUET-152 > URL: https://issues.apache.org/jira/browse/PARQUET-152 > Project: Parquet > Issue Type: Bug >Reporter: Nezih Yigitbasi >Assignee: Sergio Peña >Priority: Minor > Fix For: 1.8.0 > > > While running some tests against the master branch I hit an encoding issue > that seemed like a bug to me. > I noticed that when writing a fixed length byte array and the array's size is > > dictionaryPageSize (in my test it was 512), the encoding falls back to > DELTA_BYTE_ARRAY as seen below: > {noformat} > Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: > written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B > raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY] > {noformat} > But then read fails with the following exception: > {noformat} > Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is > only supported for type BINARY > at parquet.column.Encoding$7.getValuesReader(Encoding.java:193) > at > parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534) > at > parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574) > at > parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54) > at > parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518) > at > parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510) > at parquet.column.page.DataPageV2.accept(DataPageV2.java:123) > at > parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510) > at > parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502) > at > parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604) > at > parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:348) > at > parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63) > at > parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58) > at > parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:267) > at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131) > at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96) > at > parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136) > at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96) > at > parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129) > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198) > ... 16 more > {noformat} > When the array's size is < dictionaryPageSize, RLE_DICTIONARY encoding is > used and read works fine: > {noformat} > Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: > written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B > comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, > 1B comp} > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-152) Encoding issue with fixed length byte arrays
[ https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596763#comment-14596763 ] Sergio Peña commented on PARQUET-152: - A fixed length byte array is still written as a BINARY, isn't it? If so, then I think we should allow FIXED_LEN_BYTE_ARRAY to be decoded by Encoding.DELTA_BYTE_ARRAY. I tested something like this, and it worked: {code} DELTA_BYTE_ARRAY { @Override public ValuesReader getValuesReader(ColumnDescriptor descriptor, ValuesType valuesType) { if (descriptor.getType() != BINARY descriptor.getType() != FIXED_LEN_BYTE_ARRAY) { throw new ParquetDecodingException(Encoding DELTA_BYTE_ARRAY is only supported for type BINARY and FIXED_LEN_BYTE_ARRAY); } return new DeltaByteArrayReader(); } }, {code} I'll create a PR, and run more tests to check this scenario. Encoding issue with fixed length byte arrays Key: PARQUET-152 URL: https://issues.apache.org/jira/browse/PARQUET-152 Project: Parquet Issue Type: Bug Reporter: Nezih Yigitbasi Priority: Minor While running some tests against the master branch I hit an encoding issue that seemed like a bug to me. I noticed that when writing a fixed length byte array and the array's size is dictionaryPageSize (in my test it was 512), the encoding falls back to DELTA_BYTE_ARRAY as seen below: {noformat} Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY] {noformat} But then read fails with the following exception: {noformat} Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only supported for type BINARY at parquet.column.Encoding$7.getValuesReader(Encoding.java:193) at parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534) at parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574) at parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54) at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518) at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510) at parquet.column.page.DataPageV2.accept(DataPageV2.java:123) at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510) at parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502) at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604) at parquet.column.impl.ColumnReaderImpl.init(ColumnReaderImpl.java:348) at parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63) at parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58) at parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:267) at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131) at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96) at parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136) at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198) ... 16 more {noformat} When the array's size is dictionaryPageSize, RLE_DICTIONARY encoding is used and read works fine: {noformat} Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 1B comp} {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-152) Encoding issue with fixed length byte arrays
[ https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592474#comment-14592474 ] Ryan Blue commented on PARQUET-152: --- I think the RLE_DICTIONARY behavior is probably because the dictionary is using plain encoding rather than delta byte array. Encoding issue with fixed length byte arrays Key: PARQUET-152 URL: https://issues.apache.org/jira/browse/PARQUET-152 Project: Parquet Issue Type: Bug Reporter: Nezih Yigitbasi Priority: Minor While running some tests against the master branch I hit an encoding issue that seemed like a bug to me. I noticed that when writing a fixed length byte array and the array's size is dictionaryPageSize (in my test it was 512), the encoding falls back to DELTA_BYTE_ARRAY as seen below: {noformat} Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY] {noformat} But then read fails with the following exception: {noformat} Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only supported for type BINARY at parquet.column.Encoding$7.getValuesReader(Encoding.java:193) at parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534) at parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574) at parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54) at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518) at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510) at parquet.column.page.DataPageV2.accept(DataPageV2.java:123) at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510) at parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502) at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604) at parquet.column.impl.ColumnReaderImpl.init(ColumnReaderImpl.java:348) at parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63) at parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58) at parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:267) at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131) at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96) at parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136) at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198) ... 16 more {noformat} When the array's size is dictionaryPageSize, RLE_DICTIONARY encoding is used and read works fine: {noformat} Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 1B comp} {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)