[jira] [Commented] (PARQUET-152) Encoding issue with fixed length byte arrays

2023-01-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677297#comment-17677297
 ] 

Antoine Pitrou commented on PARQUET-152:


It would be nice if the encodings spec had been updated as well, because for 
now it mentions that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY columns, 
not FIXED_LEN_BYTE_ARRAY. See PARQUET-2231.

> Encoding issue with fixed length byte arrays
> 
>
> Key: PARQUET-152
> URL: https://issues.apache.org/jira/browse/PARQUET-152
> Project: Parquet
>  Issue Type: Bug
>Reporter: Nezih Yigitbasi
>Assignee: Sergio Peña
>Priority: Minor
> Fix For: 1.8.0
>
>
> While running some tests against the master branch I hit an encoding issue 
> that seemed like a bug to me.
> I noticed that when writing a fixed length byte array and the array's size is 
> > dictionaryPageSize (in my test it was 512), the encoding falls back to 
> DELTA_BYTE_ARRAY as seen below:
> {noformat}
> Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
> written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B 
> raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY]
> {noformat}
> But then read fails with the following exception:
> {noformat}
> Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is 
> only supported for type BINARY
>   at parquet.column.Encoding$7.getValuesReader(Encoding.java:193)
>   at 
> parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534)
>   at 
> parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574)
>   at 
> parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54)
>   at 
> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518)
>   at 
> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510)
>   at parquet.column.page.DataPageV2.accept(DataPageV2.java:123)
>   at 
> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510)
>   at 
> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502)
>   at 
> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604)
>   at 
> parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:348)
>   at 
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
>   at 
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
>   at 
> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:267)
>   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
>   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
>   at 
> parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
>   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
>   at 
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
>   ... 16 more
> {noformat}
> When the array's size is < dictionaryPageSize, RLE_DICTIONARY encoding is 
> used and read works fine:
> {noformat}
> Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
> written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B 
> comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 
> 1B comp}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-152) Encoding issue with fixed length byte arrays

2015-06-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596763#comment-14596763
 ] 

Sergio Peña commented on PARQUET-152:
-

A fixed length byte array is still written as a BINARY, isn't it? If so, then I 
think we should allow FIXED_LEN_BYTE_ARRAY to be decoded by 
Encoding.DELTA_BYTE_ARRAY. 

I tested something like this, and it worked:

{code}
DELTA_BYTE_ARRAY {
@Override
public ValuesReader getValuesReader(ColumnDescriptor descriptor,
ValuesType valuesType) {
  if (descriptor.getType() != BINARY  descriptor.getType() != 
FIXED_LEN_BYTE_ARRAY) {
throw new ParquetDecodingException(Encoding DELTA_BYTE_ARRAY is only 
supported for type BINARY and FIXED_LEN_BYTE_ARRAY);
  }
  return new DeltaByteArrayReader();
}
  },
{code}

I'll create a PR, and run more tests to check this scenario.

 Encoding issue with fixed length byte arrays
 

 Key: PARQUET-152
 URL: https://issues.apache.org/jira/browse/PARQUET-152
 Project: Parquet
  Issue Type: Bug
Reporter: Nezih Yigitbasi
Priority: Minor

 While running some tests against the master branch I hit an encoding issue 
 that seemed like a bug to me.
 I noticed that when writing a fixed length byte array and the array's size is 
  dictionaryPageSize (in my test it was 512), the encoding falls back to 
 DELTA_BYTE_ARRAY as seen below:
 {noformat}
 Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
 written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B 
 raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY]
 {noformat}
 But then read fails with the following exception:
 {noformat}
 Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is 
 only supported for type BINARY
   at parquet.column.Encoding$7.getValuesReader(Encoding.java:193)
   at 
 parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534)
   at 
 parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574)
   at 
 parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54)
   at 
 parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518)
   at 
 parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510)
   at parquet.column.page.DataPageV2.accept(DataPageV2.java:123)
   at 
 parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510)
   at 
 parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502)
   at 
 parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604)
   at 
 parquet.column.impl.ColumnReaderImpl.init(ColumnReaderImpl.java:348)
   at 
 parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
   at 
 parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
   at 
 parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:267)
   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
   at 
 parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
   at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
   ... 16 more
 {noformat}
 When the array's size is  dictionaryPageSize, RLE_DICTIONARY encoding is 
 used and read works fine:
 {noformat}
 Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
 written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B 
 comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 
 1B comp}
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-152) Encoding issue with fixed length byte arrays

2015-06-18 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592474#comment-14592474
 ] 

Ryan Blue commented on PARQUET-152:
---

I think the RLE_DICTIONARY behavior is probably because the dictionary is using 
plain encoding rather than delta byte array.

 Encoding issue with fixed length byte arrays
 

 Key: PARQUET-152
 URL: https://issues.apache.org/jira/browse/PARQUET-152
 Project: Parquet
  Issue Type: Bug
Reporter: Nezih Yigitbasi
Priority: Minor

 While running some tests against the master branch I hit an encoding issue 
 that seemed like a bug to me.
 I noticed that when writing a fixed length byte array and the array's size is 
  dictionaryPageSize (in my test it was 512), the encoding falls back to 
 DELTA_BYTE_ARRAY as seen below:
 {noformat}
 Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
 written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B 
 raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY]
 {noformat}
 But then read fails with the following exception:
 {noformat}
 Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is 
 only supported for type BINARY
   at parquet.column.Encoding$7.getValuesReader(Encoding.java:193)
   at 
 parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534)
   at 
 parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574)
   at 
 parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54)
   at 
 parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518)
   at 
 parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510)
   at parquet.column.page.DataPageV2.accept(DataPageV2.java:123)
   at 
 parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510)
   at 
 parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502)
   at 
 parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604)
   at 
 parquet.column.impl.ColumnReaderImpl.init(ColumnReaderImpl.java:348)
   at 
 parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
   at 
 parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
   at 
 parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:267)
   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
   at 
 parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
   at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
   ... 16 more
 {noformat}
 When the array's size is  dictionaryPageSize, RLE_DICTIONARY encoding is 
 used and read works fine:
 {noformat}
 Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
 written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B 
 comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 
 1B comp}
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)