Re: Parquet Block Size Detection

Abdel Hakim Deneche Fri, 01 Jul 2016 12:09:06 -0700

some answers inline:

On Fri, Jul 1, 2016 at 10:56 AM, John Omernik <[email protected]> wrote:


> I looked at that, and both the meta and schema options didn't provide me
> block size.
>
> I may be looking at parquet block size wrong, so let me toss out some
> observations, and inferences I am making, and then others who know the
> spec/format can confirm or correct.
>
> 1. The block size in parquet is NOT file size. A Parquet file can have
> multiple blocks in a single file? (Question: when this occurs, do the
> blocks then line up with DFS block size/chunk size as recommended, or do we
> get weird issues?) In practice, do writes aim for 1 block per file?
>

Drill always writes one row group per file.


> 2. The block size, when writing is computed prior to compression. This is
> an inference based on the parquet-mr library.  A job that has a parquet
> block size of 384mb seems to average files of around 256 mb in size. Thus,
> my theory is that the amount of data in parquet block size is computed
> prior to write, and then as the file is written compression is applied,
> thus ensuring that the block size (and file size if 1 is not true, or if
> you are just writing a single file) will be under the dfs.block size if you
> make both settings the same.
> 3. Because of 2, setting dfs.blocksize = parquet blocksize is a good rule,
> because the files will always be under the dfsblock size with compression,
> ensuring you don't have cross block reads happening.  (You don't have to,
> for example, set the parquet block size to be less then dfs block size to
> ensure you don't have any weird issues)
> 4.  Also because of 2, with compression enabled, you don't need any slack
> space for file headers or footers to ensure the files don't cross DFS
> blocks.
> 5. In general larger dfs/parquet block sizes will be good for reader
> performance, however, as you start to get larger, write memory demands
> increase.  True/False?  In general does a larger block size also put
> pressures on Reader memory?
>

We already know the writer will use more heap if you have larger block
sizes.
I believe the current implementation of the reader won't necessarely use
more memory as it will always try to read a specific number of rows at a
time (not sure though).


> 6. Any other thoughts/challenges on block size?  When talking about
> hundreds/thousands of GB of data, little changes in performance like with
> block size can make a difference.  I am really interested in tips/stories
> to help me understand better.
>
> John
>
>
>
> On Fri, Jul 1, 2016 at 12:26 PM, Parth Chandra <[email protected]>
> wrote:
>
> > parquet-tools perhaps?
> >
> > https://github.com/Parquet/parquet-mr/tree/master/parquet-tools
> >
> >
> >
> > On Fri, Jul 1, 2016 at 5:39 AM, John Omernik <[email protected]> wrote:
> >
> > > Is there any way, with Drill or with other tools, given a Parquet file,
> > to
> > > detect the block size it was written with?  I am copying data from one
> > > cluster to another, and trying to determine the block size.
> > >
> > > While I was able to get the size by asking the devs, I was wondering,
> is
> > > there any way to reliably detect it?
> > >
> > > John
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Parquet Block Size Detection

Reply via email to