I looked at that, and both the meta and schema options didn't provide me block size.
I may be looking at parquet block size wrong, so let me toss out some observations, and inferences I am making, and then others who know the spec/format can confirm or correct. 1. The block size in parquet is NOT file size. A Parquet file can have multiple blocks in a single file? (Question: when this occurs, do the blocks then line up with DFS block size/chunk size as recommended, or do we get weird issues?) In practice, do writes aim for 1 block per file? 2. The block size, when writing is computed prior to compression. This is an inference based on the parquet-mr library. A job that has a parquet block size of 384mb seems to average files of around 256 mb in size. Thus, my theory is that the amount of data in parquet block size is computed prior to write, and then as the file is written compression is applied, thus ensuring that the block size (and file size if 1 is not true, or if you are just writing a single file) will be under the dfs.block size if you make both settings the same. 3. Because of 2, setting dfs.blocksize = parquet blocksize is a good rule, because the files will always be under the dfsblock size with compression, ensuring you don't have cross block reads happening. (You don't have to, for example, set the parquet block size to be less then dfs block size to ensure you don't have any weird issues) 4. Also because of 2, with compression enabled, you don't need any slack space for file headers or footers to ensure the files don't cross DFS blocks. 5. In general larger dfs/parquet block sizes will be good for reader performance, however, as you start to get larger, write memory demands increase. True/False? In general does a larger block size also put pressures on Reader memory? 6. Any other thoughts/challenges on block size? When talking about hundreds/thousands of GB of data, little changes in performance like with block size can make a difference. I am really interested in tips/stories to help me understand better. John On Fri, Jul 1, 2016 at 12:26 PM, Parth Chandra <[email protected]> wrote: > parquet-tools perhaps? > > https://github.com/Parquet/parquet-mr/tree/master/parquet-tools > > > > On Fri, Jul 1, 2016 at 5:39 AM, John Omernik <[email protected]> wrote: > > > Is there any way, with Drill or with other tools, given a Parquet file, > to > > detect the block size it was written with? I am copying data from one > > cluster to another, and trying to determine the block size. > > > > While I was able to get the size by asking the devs, I was wondering, is > > there any way to reliably detect it? > > > > John > > >
