Hi Bryan, I'm planning to add these generated parquet files to an impala S3 table. I noticed that impala written parquet files contain only one row group. That's why I'm trying to write one row group per file.
However, I tried to create small parquet files (Snappy compressed) first and use a MergeRecord Processor with a ParquetRecordSetWriter in which the row group size is set to 256 MB to generate parquet files with one row group. The configurations I used, 1. Merge Strategy: Bin-Packing Algorithm 2. Minimum Number of Records: 1 3. Maximum Number of Records: 2500000 (2.5 million) 4. Minimum Bin Size : 230 MB 5. Maximum Bin Size : 256 MB 6. Max Bin Age: 20 minutes Note that, above mentioned small parquet files usually contain 200,000 records and size is about 21 MB- 22 MB. Hence about 12 files should be merged to generate one file. But when I run the processor, it always merges 19 files and generates files of size 415 MB - 417 MB. I'm using NIFI 1.13.1. Could you please let me know how to resolve this issue. Thanks & Regards *Vibhath Ileperuma* On Fri, Mar 19, 2021 at 8:45 PM Bryan Bende <[email protected]> wrote: > Hello, > > What would the reason be to need only one row group per file? Parquet > files by design can have many row groups. > > The ParquetRecordSetWriter won't be able to do this since it is just > given an output stream to write all the records to, which happens to > be the outputstream for one flow file. > > -Bryan > > On Fri, Mar 19, 2021 at 10:31 AM Vibhath Ileperuma > <[email protected]> wrote: > > > > Hi all, > > > > I'm developing a NIFI flow to convert a set of csv data to parquet > format and upload them to a S3 bucket. I use a 'ConvertRecord' processor > with a csv reader and a parquet record set writer to convert data and use a > 'PutS3Object' to send it to S3 bucket. > > > > When converting, I need to make sure the parquet row group size is 256 > MB and each parquet file contains only one row group. Even Though it is > possible to set the row group size in ParquetRecordSetWriter, I couldn't > find a way to make sure each parquet file contains only one row group (If a > csv file contains data more than required for a 256MB row group, multiple > parquet files should be generated). > > > > I would be grateful if you could suggest a way to do this. > > > > Thanks & Regards > > > > Vibhath Ileperuma > > > > > > >
