I am looking forward to the MapR 1.7 dev preview because of the metadata
user impersonation JIRA fix.   "Drill always writes one row group per
file." So is this one parquet block?  "row group" is a new term to this
email :)

On Fri, Jul 1, 2016 at 2:09 PM, Abdel Hakim Deneche <adene...@maprtech.com>
wrote:

> Just make sure you enable parquet metadata caching, otherwise the more
> files you have the more time Drill will spend reading the metadata from
> every single file.
>
> On Fri, Jul 1, 2016 at 11:17 AM, John Omernik <j...@omernik.com> wrote:
>
> > In addition
> > 7. Generally speaking, keeping number of files low, will help in multiple
> > phases of planning/execution. True/False
> >
> >
> >
> > On Fri, Jul 1, 2016 at 12:56 PM, John Omernik <j...@omernik.com> wrote:
> >
> > > I looked at that, and both the meta and schema options didn't provide
> me
> > > block size.
> > >
> > > I may be looking at parquet block size wrong, so let me toss out some
> > > observations, and inferences I am making, and then others who know the
> > > spec/format can confirm or correct.
> > >
> > > 1. The block size in parquet is NOT file size. A Parquet file can have
> > > multiple blocks in a single file? (Question: when this occurs, do the
> > > blocks then line up with DFS block size/chunk size as recommended, or
> do
> > we
> > > get weird issues?) In practice, do writes aim for 1 block per file?
> > > 2. The block size, when writing is computed prior to compression. This
> is
> > > an inference based on the parquet-mr library.  A job that has a parquet
> > > block size of 384mb seems to average files of around 256 mb in size.
> > Thus,
> > > my theory is that the amount of data in parquet block size is computed
> > > prior to write, and then as the file is written compression is applied,
> > > thus ensuring that the block size (and file size if 1 is not true, or
> if
> > > you are just writing a single file) will be under the dfs.block size if
> > you
> > > make both settings the same.
> > > 3. Because of 2, setting dfs.blocksize = parquet blocksize is a good
> > rule,
> > > because the files will always be under the dfsblock size with
> > compression,
> > > ensuring you don't have cross block reads happening.  (You don't have
> to,
> > > for example, set the parquet block size to be less then dfs block size
> to
> > > ensure you don't have any weird issues)
> > > 4.  Also because of 2, with compression enabled, you don't need any
> slack
> > > space for file headers or footers to ensure the files don't cross DFS
> > > blocks.
> > > 5. In general larger dfs/parquet block sizes will be good for reader
> > > performance, however, as you start to get larger, write memory demands
> > > increase.  True/False?  In general does a larger block size also put
> > > pressures on Reader memory?
> > > 6. Any other thoughts/challenges on block size?  When talking about
> > > hundreds/thousands of GB of data, little changes in performance like
> with
> > > block size can make a difference.  I am really interested in
> tips/stories
> > > to help me understand better.
> > >
> > > John
> > >
> > >
> > >
> > > On Fri, Jul 1, 2016 at 12:26 PM, Parth Chandra <pchan...@maprtech.com>
> > > wrote:
> > >
> > >> parquet-tools perhaps?
> > >>
> > >> https://github.com/Parquet/parquet-mr/tree/master/parquet-tools
> > >>
> > >>
> > >>
> > >> On Fri, Jul 1, 2016 at 5:39 AM, John Omernik <j...@omernik.com>
> wrote:
> > >>
> > >> > Is there any way, with Drill or with other tools, given a Parquet
> > file,
> > >> to
> > >> > detect the block size it was written with?  I am copying data from
> one
> > >> > cluster to another, and trying to determine the block size.
> > >> >
> > >> > While I was able to get the size by asking the devs, I was
> wondering,
> > is
> > >> > there any way to reliably detect it?
> > >> >
> > >> > John
> > >> >
> > >>
> > >
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Reply via email to