[ 
https://issues.apache.org/jira/browse/PARQUET-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2027.
---------------------------------------
    Fix Version/s: 1.12.1
       Resolution: Fixed

> Merging parquet files created in 1.11.1 not possible using 1.12.0 
> ------------------------------------------------------------------
>
>                 Key: PARQUET-2027
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2027
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.0
>            Reporter: Matthew M
>            Assignee: Gabor Szadovszky
>            Priority: Major
>             Fix For: 1.12.1
>
>
> I have parquet files created using 1.11.1. In the process I join two files 
> (with the same schema) into a one output file. I create Hadoop writer:
> {code:scala}
> val hadoopWriter = new ParquetFileWriter(
>       HadoopOutputFile.fromPath(
>         new Path(outputPath.toString),
>         new Configuration()
>       ), outputSchema, Mode.OVERWRITE,
>       8 * 1024 * 1024,
>       2097152,
>       DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH,
>       DEFAULT_STATISTICS_TRUNCATE_LENGTH,
>       DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED
>     )
>     hadoopWriter.start()
> {code}
> and try to append one file into another:
> {code:scala}
> hadoopWriter.appendFile(HadoopInputFile.fromPath(new Path(file), new 
> Configuration()))
> {code}
> Everything works on 1.11.1. But when I've switched to 1.12.0 it fails with 
> that error:
> {code:scala}
> STDERR: Exception in thread "main" java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' 
> was not found in serialized data! Struct: 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4
>  at org.apache.parquet.format.Util.read(Util.java:365)
>  at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
>  at org.apache.parquet.format.Util.readPageHeader(Util.java:127)
>  at org.apache.parquet.hadoop.Offsets.readDictionaryPageSize(Offsets.java:75)
>  at org.apache.parquet.hadoop.Offsets.getOffsets(Offsets.java:58)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroup(ParquetFileWriter.java:998)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroups(ParquetFileWriter.java:918)
>  at 
> org.apache.parquet.hadoop.ParquetFileReader.appendTo(ParquetFileReader.java:888)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:895)
>  at [...]
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'uncompressed_page_size' was not found in serialized data! 
> Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4
>  at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1108)
>  at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
>  at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
>  at org.apache.parquet.format.Util.read(Util.java:362)
>  ... 14 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to