Hi Weston, Thank you for the clarification! The default 512MB, and the slightly smaller writes align with what I've been seeing and after using SetMaxRowGroupSize to change the MaxRowGroupSize, I am seeing the expected behavior with smaller values.
In terms of the implications of setting a smaller value for the MaxRowGroupSize, is it mainly the increased number of syscalls required to persist to disk, or is there anything else that would be a side effect? I am particularly interested in keeping my memory usage down, so I'm trying to get a better sense of the memory "landscape" that parquet utilizes. Once the row group is persisted to disk, the space that the row group previously utilized in memory should be freed for use once more right? Thank You, Arun Joseph On Fri, Aug 26, 2022 at 9:36 AM Weston Pace <[email protected]> wrote: > The constant DEFAULT_MAX_ROW_GROUP_LENGTH is for > parquet::WriterProperties::max_row_group_length and the unit here is # > of rows. This is used by parquet::ParquetFileWriter. The > parquet::StreamWriter class wraps an instance of a file writer and > adds the property MaxRowGroupSize. This units for MaxRowGroupSize is > indeed bytes. > > The max_row_group_length property is only applied when calling > ParquetFileWriter::WriteTable. The stream writer operates at a lower > level and never calls this method. So the stream writer should never > be affected by the max_row_group_length property. > > One thing to keep in mind is that MaxRowGroupSize is an estimate only. > With certain encodings it can be rather difficult to know ahead of > time how many bytes you will end up writing unless you separate the > encoding step from the write step (which would require an extra memcpy > I think). In practice I think the estimators are conservative so you > will usually end up with something slightly smaller than 512MB. If it > is significantly smaller you may need to investigate how effective > your encodings are and see if that is the cause. > > On Fri, Aug 26, 2022 at 4:51 AM Arun Joseph <[email protected]> wrote: > > > > Hi all, > > > > My understanding of the StreamWriter class is that it would persist Row > Groups to disk once they exceed a certain size. In the documentation, it > seems like this size is 512MB, but if I look at > arrow/include/parquet/properties.h, the DEFAULT_MAX_ROW_GROUP_LENGTH seems > to be 64MB. Is this reset to 512MB elsewhere? My parquet version is > > > > #define CREATED_BY_VERSION "parquet-cpp-arrow version 9.0.0-SNAPSHOT > > > > Thank You, > > Arun Joseph > -- Arun Joseph
