some answers inline: On Fri, Jul 1, 2016 at 10:56 AM, John Omernik <[email protected]> wrote:
> I looked at that, and both the meta and schema options didn't provide me > block size. > > I may be looking at parquet block size wrong, so let me toss out some > observations, and inferences I am making, and then others who know the > spec/format can confirm or correct. > > 1. The block size in parquet is NOT file size. A Parquet file can have > multiple blocks in a single file? (Question: when this occurs, do the > blocks then line up with DFS block size/chunk size as recommended, or do we > get weird issues?) In practice, do writes aim for 1 block per file? > Drill always writes one row group per file. > 2. The block size, when writing is computed prior to compression. This is > an inference based on the parquet-mr library. A job that has a parquet > block size of 384mb seems to average files of around 256 mb in size. Thus, > my theory is that the amount of data in parquet block size is computed > prior to write, and then as the file is written compression is applied, > thus ensuring that the block size (and file size if 1 is not true, or if > you are just writing a single file) will be under the dfs.block size if you > make both settings the same. > 3. Because of 2, setting dfs.blocksize = parquet blocksize is a good rule, > because the files will always be under the dfsblock size with compression, > ensuring you don't have cross block reads happening. (You don't have to, > for example, set the parquet block size to be less then dfs block size to > ensure you don't have any weird issues) > 4. Also because of 2, with compression enabled, you don't need any slack > space for file headers or footers to ensure the files don't cross DFS > blocks. > 5. In general larger dfs/parquet block sizes will be good for reader > performance, however, as you start to get larger, write memory demands > increase. True/False? In general does a larger block size also put > pressures on Reader memory? > We already know the writer will use more heap if you have larger block sizes. I believe the current implementation of the reader won't necessarely use more memory as it will always try to read a specific number of rows at a time (not sure though). > 6. Any other thoughts/challenges on block size? When talking about > hundreds/thousands of GB of data, little changes in performance like with > block size can make a difference. I am really interested in tips/stories > to help me understand better. > > John > > > > On Fri, Jul 1, 2016 at 12:26 PM, Parth Chandra <[email protected]> > wrote: > > > parquet-tools perhaps? > > > > https://github.com/Parquet/parquet-mr/tree/master/parquet-tools > > > > > > > > On Fri, Jul 1, 2016 at 5:39 AM, John Omernik <[email protected]> wrote: > > > > > Is there any way, with Drill or with other tools, given a Parquet file, > > to > > > detect the block size it was written with? I am copying data from one > > > cluster to another, and trying to determine the block size. > > > > > > While I was able to get the size by asking the devs, I was wondering, > is > > > there any way to reliably detect it? > > > > > > John > > > > > > -- Abdelhakim Deneche Software Engineer <http://www.mapr.com/> Now Available - Free Hadoop On-Demand Training <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
