Re: Solution to read/write multiple parquet files
> Then what is the best practice to cutting these rows into > parquet files ? This depends a bit on what you are going to do with them afterwards. Typically RowGroups should sized such that you can load them in bulk into memory if you do batch processing on them. If you only plan to query on them, depending on the query engine having smaller or larger RowGroups will make a performance difference. In general, it is best to check what happens with these files and then profile. > Another question is that should we keep same RowGroup size for one > parquet file ? You can vary the RowGroup size inside a Parquet file if that gives you a better performance. Probably it is best to keep them even so that the size of the materialized data in memory is the same for all RowGroups. Uwe On Tue, Apr 3, 2018, at 11:37 AM, Lizhou Gao wrote: > Thanks for your quick reply! > Given below scenario, there are 200k rows of sql data, 0-100k > contains more nulls while 100k-200k contains more not null values.> If we > convert two parts into parquet files, we may get 0-100k.parquet > (500M), 100k-200k(1.3G). Then what is the best practice to cutting > these rows into> parquet files ? > Another question is that should we keep same RowGroup size for one > parquet file ?> > > > Thanks, > Lizhou > -- Original -- > *From: * "Uwe L. Korn"<uw...@xhochy.com>; > *Date: * Tue, Apr 3, 2018 04:21 PM > *To: * "dev"<dev@parquet.apache.org>; > > *Subject: * Re: Solution to read/write multiple parquet files > > Hello Lizhou, > > on the Python side there is http://dask.pydata.org/en/latest/ that can > read large, distributed Parquet datasets. When using `engine=pyarrow`, > it also uses parquet-cpp under the hood.> > On the pure C++ side, I know that https://github.com/thrill/thrill has > experimental parquet support. But this is an experimental feature in > an experimental framework, so be careful about relying on it.> > In general, Parquet files should not exceed the single digit gigabyte > size and the RowGroups inside these files should also be 128MiB or > less. You will be able to write tools that can deal with other sizes > but that will break a bit the portability aspect of Parquet file.> > Uwe > > On Tue, Apr 3, 2018, at 10:00 AM, 高立周 wrote: > > Hi experts, > >We have a storage engine that needs to manage large set of data > >(PB> > level) . Currently we store it as a single parquet file. After > > some> > searching, It seems the data should be cut into > > multiple parquet files for further reading/writing/managing. But I> > > > don't know whether there is already opensource solution to > > read/write/> > manage multiple parquet files. Our programming language is > > cpp. > > Any comments/suggestions are welcomed. Thanks! > > > > > > Regards, > > Lizhou
Re: Solution to read/write multiple parquet files
Thanks for your quick reply! Given below scenario, there are 200k rows of sql data, 0-100k contains more nulls while 100k-200k contains more not null values. If we convert two parts into parquet files, we may get 0-100k.parquet (500M), 100k-200k(1.3G). Then what is the best practice to cutting these rows into parquet files ? Another question is that should we keep same RowGroup size for one parquet file ? Thanks, Lizhou -- Original -- From: "Uwe L. Korn"<uw...@xhochy.com>; Date: Tue, Apr 3, 2018 04:21 PM To: "dev"<dev@parquet.apache.org>; Subject: Re: Solution to read/write multiple parquet files Hello Lizhou, on the Python side there is http://dask.pydata.org/en/latest/ that can read large, distributed Parquet datasets. When using `engine=pyarrow`, it also uses parquet-cpp under the hood. On the pure C++ side, I know that https://github.com/thrill/thrill has experimental parquet support. But this is an experimental feature in an experimental framework, so be careful about relying on it. In general, Parquet files should not exceed the single digit gigabyte size and the RowGroups inside these files should also be 128MiB or less. You will be able to write tools that can deal with other sizes but that will break a bit the portability aspect of Parquet file. Uwe On Tue, Apr 3, 2018, at 10:00 AM, 高立周 wrote: > Hi experts, >We have a storage engine that needs to manage large set of data (PB > level) . Currently we store it as a single parquet file. After some > searching, It seems the data should be cut into > multiple parquet files for further reading/writing/managing. But I > don't know whether there is already opensource solution to read/write/ > manage multiple parquet files. Our programming language is cpp. > Any comments/suggestions are welcomed. Thanks! > > > Regards, > Lizhou
Re: Solution to read/write multiple parquet files
Hello Lizhou, on the Python side there is http://dask.pydata.org/en/latest/ that can read large, distributed Parquet datasets. When using `engine=pyarrow`, it also uses parquet-cpp under the hood. On the pure C++ side, I know that https://github.com/thrill/thrill has experimental parquet support. But this is an experimental feature in an experimental framework, so be careful about relying on it. In general, Parquet files should not exceed the single digit gigabyte size and the RowGroups inside these files should also be 128MiB or less. You will be able to write tools that can deal with other sizes but that will break a bit the portability aspect of Parquet file. Uwe On Tue, Apr 3, 2018, at 10:00 AM, 高立周 wrote: > Hi experts, >We have a storage engine that needs to manage large set of data (PB > level) . Currently we store it as a single parquet file. After some > searching, It seems the data should be cut into > multiple parquet files for further reading/writing/managing. But I > don't know whether there is already opensource solution to read/write/ > manage multiple parquet files. Our programming language is cpp. > Any comments/suggestions are welcomed. Thanks! > > > Regards, > Lizhou