If you're storing this in S3... you might want to selectively read the files as 
well.


I'm only speculating, but if you want to download the data, downloading as a 
queue of files might be more reliable than one massive file. Similarly, within 
AWS, it *might* be faster to have an EC2 instance access a couple of large 
Parquet files versus one massive Parquet file.


Remember that when you create a large block size, Drill tries to write 
everything within a single row group for each. So there is no chance of 
parallelization of the read (i.e. reading parts in parallel). The defaults 
should work well for S3 as well, and with the compression (e.g. Snappy), you 
should get a reasonably smaller file size.


With the current default settings... have you seen what Parquet file sizes you 
get with Drill when converting your 10GB CSV source files?


________________________________
From: Shuporno Choudhury <shuporno.choudh...@manthan.com>
Sent: Friday, June 9, 2017 10:50:06 AM
To: user@drill.apache.org
Subject: Re: Increasing store.parquet.block-size

Thanks Kunal for your insight.
I am actually converting some .csv files and storing them in parquet format
in s3, not in HDFS.
The size of the individual .csv source files can be quite huge (around
10GB).
So, is there a way to overcome this and create one parquet file or do I
have to go ahead with multiple parquet files?

On 09-Jun-2017 11:04 PM, "Kunal Khatua" <kkha...@mapr.com> wrote:

> Shuporno
>
>
> There are some interesting problems when using Parquet files > 2GB on HDFS.
>
>
> If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
> enough) returns an int value. Large Parquet blocksize also means you'll end
> up having the file span across multiple HDFS blocks, and that would make
> reading of rowgroups inefficient.
>
>
> Is there a reason you want to create such a large parquet file?
>
>
> ~ Kunal
>
> ________________________________
> From: Vitalii Diravka <vitalii.dira...@gmail.com>
> Sent: Friday, June 9, 2017 4:49:02 AM
> To: user@drill.apache.org
> Subject: Re: Increasing store.parquet.block-size
>
> Khurram,
>
> DRILL-2478 is a good place holder for the LongValidator issue, it really
> works wrong.
>
> But other issue connected to impossibility to use long values for parquet
> block-size.
> This issue can be independent task or a sub-task of updating Drill project
> to a latest parquet library.
>
> Kind regards
> Vitalii
>
> On Fri, Jun 9, 2017 at 10:25 AM, Khurram Faraaz <kfar...@mapr.com> wrote:
>
> >   1.  DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is
> > Open for this issue.
> >   2.  I have added more details into the comments.
> >
> > Thanks,
> > Khurram
> >
> > ________________________________
> > From: Shuporno Choudhury <shuporno.choudh...@manthan.com>
> > Sent: Friday, June 9, 2017 12:48:41 PM
> > To: user@drill.apache.org
> > Subject: Increasing store.parquet.block-size
> >
> > The max value that can be assigned to *store.parquet.block-size *is
> > *2147483647*, as the value kind of this configuration parameter is LONG.
> > This basically translates to 2GB of block size.
> > How do I increase it to 3/4/5 GB ?
> > Trying to set this parameter to a higher value using the following
> command
> > actually succeeds :
> >     ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
> > But when I try to run a query that uses this config, it throws the
> > following error:
> >    Error: SYSTEM ERROR: NumberFormatException: For input string:
> > "4294967296"
> > So, is it possible to assign a higher value to this parameter?
> > --
> > Regards,
> > Shuporno Choudhury
> >
>

Reply via email to