Hi Bryan,

I'm planning to add these generated parquet files to an impala S3 table.
I noticed that impala written parquet files contain only one row group.
That's why I'm trying to write one row group per file.

However, I tried to create small parquet files (Snappy compressed) first
and use a MergeRecord Processor with a ParquetRecordSetWriter in which the
row group size is set to 256 MB to generate parquet files with one row
group. The configurations I used,

   1. Merge Strategy: Bin-Packing Algorithm
   2. Minimum Number of Records: 1
   3. Maximum Number of Records: 2500000   (2.5 million)
   4.  Minimum Bin Size : 230 MB
   5. Maximum Bin Size : 256 MB
   6. Max Bin Age: 20 minutes

Note that, above mentioned small parquet files usually contain 200,000
records and size is about 21 MB- 22 MB. Hence about 12 files should be
merged to generate one file.

But when I run the processor, it always merges 19 files and generates files
of size 415 MB - 417 MB.

I'm using NIFI 1.13.1. Could you please let me know how to resolve this
issue.

Thanks & Regards

*Vibhath Ileperuma*





On Fri, Mar 19, 2021 at 8:45 PM Bryan Bende <[email protected]> wrote:

> Hello,
>
> What would the reason be to need only one row group per file? Parquet
> files by design can have many row groups.
>
> The ParquetRecordSetWriter won't be able to do this since it is just
> given an output stream to write all the records to, which happens to
> be the outputstream for one flow file.
>
> -Bryan
>
> On Fri, Mar 19, 2021 at 10:31 AM Vibhath Ileperuma
> <[email protected]> wrote:
> >
> > Hi all,
> >
> > I'm developing a NIFI flow to convert a set of csv data to parquet
> format and upload them to a S3 bucket. I use a 'ConvertRecord' processor
> with a csv reader and a parquet record set writer to convert data and use a
> 'PutS3Object' to send it to S3 bucket.
> >
> > When converting, I need to make sure the parquet row group size is 256
> MB and each parquet file contains only one row group. Even Though it is
> possible to set the row group size in ParquetRecordSetWriter, I couldn't
> find a way to make sure each parquet file contains only one row group (If a
> csv file contains data  more than required for a 256MB row group, multiple
> parquet files should be generated).
> >
> > I would be grateful if you could suggest a way to do this.
> >
> > Thanks & Regards
> >
> > Vibhath Ileperuma
> >
> >
> >
>

Reply via email to