Re: Solution to read/write multiple parquet files

2018-04-04 Thread Uwe L. Korn
> Then what is the best practice to cutting these rows into
> parquet files ?
This depends a bit on what you are going to do with them afterwards.
Typically RowGroups should sized such that you can load them in bulk
into memory if you do batch processing on them. If you only plan to
query on them, depending on the query engine having smaller or larger
RowGroups will make a performance difference. In general, it is best to
check what happens with these files and then profile.
> Another question is that should we keep same RowGroup size for one
> parquet file ?
You can vary the RowGroup size inside a Parquet file if that gives you a
better performance. Probably it is best to keep them even so that the
size of the materialized data in memory is the same for all RowGroups.
Uwe

On Tue, Apr 3, 2018, at 11:37 AM, Lizhou Gao wrote:
> Thanks for your quick reply!
> Given below  scenario, there are 200k rows of sql data, 0-100k
> contains more nulls while 100k-200k contains more not null values.> If we 
> convert two parts into parquet files, we may get 0-100k.parquet
> (500M), 100k-200k(1.3G). Then what is the best practice to cutting
> these rows into> parquet files ?
> Another question is that should we keep same RowGroup size for one
> parquet file ?> 
> 
> 
> Thanks,
> Lizhou
> -- Original --
> *From: * "Uwe L. Korn"<uw...@xhochy.com>;
> *Date: * Tue, Apr 3, 2018 04:21 PM
> *To: * "dev"<dev@parquet.apache.org>; 
> 
> *Subject: * Re: Solution to read/write multiple parquet files
>  
> Hello Lizhou,
> 
> on the Python side there is http://dask.pydata.org/en/latest/ that can
> read large, distributed Parquet datasets. When using `engine=pyarrow`,
> it also uses parquet-cpp under the hood.> 
> On the pure C++ side, I know that https://github.com/thrill/thrill has
> experimental parquet support. But this is an experimental feature in
> an experimental framework, so be careful about relying on it.> 
> In general, Parquet files should not exceed the single digit gigabyte
> size and the RowGroups inside these files should also be 128MiB or
> less. You will be able to write tools that can deal with other sizes
> but that will break a bit the portability aspect of Parquet file.> 
> Uwe
> 
> On Tue, Apr 3, 2018, at 10:00 AM, 高立周 wrote:
> > Hi experts,
> >We have a storage engine that needs to manage large set of data
> >(PB> > level) . Currently we store it as a single parquet file. After 
> > some> > searching, It seems the data should be cut into
> > multiple parquet files for further reading/writing/managing.  But I> > 
> > don't know whether there is already opensource solution to
> > read/write/> > manage multiple parquet files.  Our programming language is 
> > cpp.
> >   Any comments/suggestions are welcomed. Thanks!
> > 
> > 
> > Regards,
> > Lizhou



Re: Solution to read/write multiple parquet files

2018-04-03 Thread Lizhou Gao
Thanks for your quick reply!
Given below  scenario, there are 200k rows of sql data, 0-100k contains more 
nulls while 100k-200k contains more not null values.
If we convert two parts into parquet files, we may get 0-100k.parquet (500M), 
100k-200k(1.3G). Then what is the best practice to cutting these rows into
parquet files ?
Another question is that should we keep same RowGroup size for one parquet file 
?  






Thanks,
Lizhou
-- Original --
From:  "Uwe L. Korn"<uw...@xhochy.com>;
Date:  Tue, Apr 3, 2018 04:21 PM
To:  "dev"<dev@parquet.apache.org>; 

Subject:  Re: Solution to read/write multiple parquet files

 
Hello Lizhou,

on the Python side there is http://dask.pydata.org/en/latest/ that can read 
large, distributed Parquet datasets. When using `engine=pyarrow`, it also uses 
parquet-cpp under the hood.

On the pure C++ side, I know that https://github.com/thrill/thrill has 
experimental parquet support. But this is an experimental feature in an 
experimental framework, so be careful about relying on it.

In general, Parquet files should not exceed the single digit gigabyte size and 
the RowGroups inside these files should also be 128MiB or less. You will be 
able to write tools that can deal with other sizes but that will break a bit 
the portability aspect of Parquet file.

Uwe

On Tue, Apr 3, 2018, at 10:00 AM, 高立周 wrote:
> Hi experts,
>We have a storage engine that needs to manage large set of data (PB 
> level) . Currently we store it as a single parquet file. After some 
> searching, It seems the data should be cut into
> multiple parquet files for further reading/writing/managing.  But I 
> don't know whether there is already opensource solution to read/write/
> manage multiple parquet files.  Our programming language is cpp.
>   Any comments/suggestions are welcomed. Thanks!
> 
> 
> Regards,
> Lizhou

Re: Solution to read/write multiple parquet files

2018-04-03 Thread Uwe L. Korn
Hello Lizhou,

on the Python side there is http://dask.pydata.org/en/latest/ that can read 
large, distributed Parquet datasets. When using `engine=pyarrow`, it also uses 
parquet-cpp under the hood.

On the pure C++ side, I know that https://github.com/thrill/thrill has 
experimental parquet support. But this is an experimental feature in an 
experimental framework, so be careful about relying on it.

In general, Parquet files should not exceed the single digit gigabyte size and 
the RowGroups inside these files should also be 128MiB or less. You will be 
able to write tools that can deal with other sizes but that will break a bit 
the portability aspect of Parquet file.

Uwe

On Tue, Apr 3, 2018, at 10:00 AM, 高立周 wrote:
> Hi experts,
>We have a storage engine that needs to manage large set of data (PB 
> level) . Currently we store it as a single parquet file. After some 
> searching, It seems the data should be cut into
> multiple parquet files for further reading/writing/managing.  But I 
> don't know whether there is already opensource solution to read/write/
> manage multiple parquet files.  Our programming language is cpp.
>   Any comments/suggestions are welcomed. Thanks!
> 
> 
> Regards,
> Lizhou