[ https://issues.apache.org/jira/browse/PARQUET-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabor Szadovszky resolved PARQUET-2027. --------------------------------------- Fix Version/s: 1.12.1 Resolution: Fixed > Merging parquet files created in 1.11.1 not possible using 1.12.0 > ------------------------------------------------------------------ > > Key: PARQUET-2027 > URL: https://issues.apache.org/jira/browse/PARQUET-2027 > Project: Parquet > Issue Type: Bug > Components: parquet-mr > Affects Versions: 1.12.0 > Reporter: Matthew M > Assignee: Gabor Szadovszky > Priority: Major > Fix For: 1.12.1 > > > I have parquet files created using 1.11.1. In the process I join two files > (with the same schema) into a one output file. I create Hadoop writer: > {code:scala} > val hadoopWriter = new ParquetFileWriter( > HadoopOutputFile.fromPath( > new Path(outputPath.toString), > new Configuration() > ), outputSchema, Mode.OVERWRITE, > 8 * 1024 * 1024, > 2097152, > DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, > DEFAULT_STATISTICS_TRUNCATE_LENGTH, > DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED > ) > hadoopWriter.start() > {code} > and try to append one file into another: > {code:scala} > hadoopWriter.appendFile(HadoopInputFile.fromPath(new Path(file), new > Configuration())) > {code} > Everything works on 1.11.1. But when I've switched to 1.12.0 it fails with > that error: > {code:scala} > STDERR: Exception in thread "main" java.io.IOException: can not read class > org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' > was not found in serialized data! Struct: > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4 > at org.apache.parquet.format.Util.read(Util.java:365) > at org.apache.parquet.format.Util.readPageHeader(Util.java:132) > at org.apache.parquet.format.Util.readPageHeader(Util.java:127) > at org.apache.parquet.hadoop.Offsets.readDictionaryPageSize(Offsets.java:75) > at org.apache.parquet.hadoop.Offsets.getOffsets(Offsets.java:58) > at > org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroup(ParquetFileWriter.java:998) > at > org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroups(ParquetFileWriter.java:918) > at > org.apache.parquet.hadoop.ParquetFileReader.appendTo(ParquetFileReader.java:888) > at > org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:895) > at [...] > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'uncompressed_page_size' was not found in serialized data! > Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4 > at > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1108) > at > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019) > at org.apache.parquet.format.PageHeader.read(PageHeader.java:896) > at org.apache.parquet.format.Util.read(Util.java:362) > ... 14 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)