For metadata, you can use 'parquet-tools dump' and pipe the output to
more/less.
Parquet dump will print the block (aka row group) and page level metadata.
It will then dump all the data so be prepared to cancel when that happens.
Setting dfs.blocksize == parquet.blocksize is a very good idea and
I am looking forward to the MapR 1.7 dev preview because of the metadata
user impersonation JIRA fix. "Drill always writes one row group per
file." So is this one parquet block? "row group" is a new term to this
email :)
On Fri, Jul 1, 2016 at 2:09 PM, Abdel Hakim Deneche
Just make sure you enable parquet metadata caching, otherwise the more
files you have the more time Drill will spend reading the metadata from
every single file.
On Fri, Jul 1, 2016 at 11:17 AM, John Omernik wrote:
> In addition
> 7. Generally speaking, keeping number of files
some answers inline:
On Fri, Jul 1, 2016 at 10:56 AM, John Omernik wrote:
> I looked at that, and both the meta and schema options didn't provide me
> block size.
>
> I may be looking at parquet block size wrong, so let me toss out some
> observations, and inferences I am
In addition
7. Generally speaking, keeping number of files low, will help in multiple
phases of planning/execution. True/False
On Fri, Jul 1, 2016 at 12:56 PM, John Omernik wrote:
> I looked at that, and both the meta and schema options didn't provide me
> block size.
>
> I
I looked at that, and both the meta and schema options didn't provide me
block size.
I may be looking at parquet block size wrong, so let me toss out some
observations, and inferences I am making, and then others who know the
spec/format can confirm or correct.
1. The block size in parquet is
parquet-tools perhaps?
https://github.com/Parquet/parquet-mr/tree/master/parquet-tools
On Fri, Jul 1, 2016 at 5:39 AM, John Omernik wrote:
> Is there any way, with Drill or with other tools, given a Parquet file, to
> detect the block size it was written with? I am copying
Is there any way, with Drill or with other tools, given a Parquet file, to
detect the block size it was written with? I am copying data from one
cluster to another, and trying to determine the block size.
While I was able to get the size by asking the devs, I was wondering, is
there any way to