Re: Setting rowGroupSize in ParquetIO

David Hollands Fri, 12 Mar 2021 08:55:42 -0800

Tbh mate, I reckon it would be quicker if you progress your PR.

Cheers,
David

________________________________
From: Bashir Sadjad <[email protected]>
Sent: 12 March 2021 16:29
To: [email protected] <[email protected]>
Subject: Re: Setting rowGroupSize in ParquetIO

Thanks David. Yes, I looked at passing it through the HadoopConfiguration but 
it seems row-group size is not there or at least ParquetWriter.Builder seems to 
set that directly from its rowGroupSize property. I filed 
BEAM-11969<https://issues.apache.org/jira/browse/BEAM-11969> for this so if you 
can contribute your patch for plumbing this, that would be great. Otherwise, I 
can send a PR.

Regards

-B

On Fri, Mar 12, 2021 at 8:25 AM David Hollands 
<[email protected]<mailto:[email protected]>> wrote:

Hi Bashir,

I think it is just a case of somebody bothering to plumbing it in explicitly, 
e.g.

/** Specifies row group size. By default, DEFAULT_BLOCK_SIZE. */

public Sink withRowGroupSize(int rowGroupSize) {

return toBuilder().setRowGroupSize(rowGroupSize).build();

}

and

this.writer =

AvroParquetWriter.<GenericRecord>builder(beamParquetOutputFile)

.withRowGroupSize(getRowGroupSize()) // Ze patch to set RowGroupSize

.withSchema(schema)

.withCompressionCodec(getCompressionCodec())

.withWriteMode(OVERWRITE)

.build();

Etc.

However, it might worth exploring if it can be set via the HadoopConfiguration 
“parquet.block.size” property, but I’m not sure that it actually can.

We patched in something explicitly last year but didn’t contribute upstream as 
there was quite a bit of activity on the ParquetIO (e.g. conversion to SDF) at 
the time.

The use case we had at the time was that some downstream consumers of the 
parquet (AWS S3 Select) couldn’t handle rowGroupSizes > 64MB uncompressed.  I’m 
sure there are other use cases out there that need this fined grained control.

Cheers, David

David Hollands

BBC Broadcast Centre, London, W12

Email: [email protected]<mailto:[email protected]>

From: Bashir Sadjad <[email protected]<mailto:[email protected]>>
Reply to: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Friday, 12 March 2021 at 07:58
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Setting rowGroupSize in ParquetIO

Hi all,

I wonder how I can set the row group size for files generated by 
ParquetIO.Sink<https://beam.apache.org/releases/javadoc/2.20.0/org/apache/beam/sdk/io/parquet/ParquetIO.Sink.html>.
 It doesn't seem to provide the option for setting that and IIUC from the 
code<https://github.com/apache/beam/blob/fffb85a35df6ae3bdb2934c077856f6b27559aa7/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L1117>,
 it uses the default value in 
ParquetWriter.Builder<https://github.com/apache/parquet-mr/blob/bdf935a43bd377c8052840a4328cf5b7603aa70a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L636>
 which is 128MB. Is there any reason not to expose this parameter in ParquetIO?

Thanks

-B

Re: Setting rowGroupSize in ParquetIO

Reply via email to