Comments below

> On Nov 30, 2016, at 4:29 PM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> On Wed, Nov 30, 2016 at 6:26 AM, Weber, Richard <riwe...@akamai.com 
> <mailto:riwe...@akamai.com>> wrote:
> Hi All,
> 
> I'm trying to figure out the right/best/easiest way to find out how much 
> space that a given table is taking up on the various tablet servers.  I'm 
> looking really at finding:
> * Physical space taken on all disks
> * Logical space taken on all disks
> * Sizing of Indices/Bloom Filters, etc.
> * Sizing with and without replication.
> 
> I'm trying to run an apples vs apples comparison of how big data is when 
> stored in Kudu compared to storing it in it's native format (Gzipped CSV) as 
> well as in Parquet format on HDFS.  Ultimately, I'd like to be able to do 
> reporting on the different tables to say Table X is taking up Y Tb, where Y 
> consists of A physical size, B Index, C Bloom, etc.
> 
> Looking through the Web UI I don't really see any good summary of how much 
> space the entire table is taking.  It seems like I'd need to walk through 
> each Tablet server, connect to the metrics page and generate the summary 
> information myself.
> 
> 
> Yea, unfortunately we do not expose much of this information in a useful way 
> at the moment. The metrics page is the best source of info for the various 
> sizes, and even those are often estimates rather than always being accurate 
> at the moment.

Ok

> 
> In terms of cross-server metrics aggregation, it's been our philosophy so far 
> that we should try to avoid doing a poor job of things that other systems are 
> likely to do better -- metrics aggregation being one such thing. It's likely 
> we'll add simple aggregation of table sizes, since that info is very useful 
> for SQL engines to do JOIN ordering, but I don't think we'd start adding the 
> more granular breakdowns like indexes, blooms, etc.

Definitely understand on that.  Index sizes (and sizes of other related data) 
are mainly of interests to me just to compare what the performance improvements 
of Kudu vs Parquet vs CSV "cost" in terms of storage.

> 
> If your use case is a one-time experiment to understand the data volumes, it 
> would be pretty straightforward to write a tool to do this kind of summary 
> against the on-disk metadata of a tablet server. For example, you can load 
> the tablet metadata, group the blocks by type/column, and then aggregate as 
> you prefer. Unfortunately this would give you only the physical size and not 
> the logical, since you'd have to scan the actual data to know its 
> uncompressed sizes.

I'm looking for the sizings really for two purposes.  
1) As mentioned above, to help assess the "costs" Kudu vs other systems we 
already have in place, especially in terms of Storage
2) Perform longer-term monitoring of sizing of different table sizes, how 
they're growing, how much resources they're using and so on.

For one particular use case we have, our data comes in as Protobuf data, and is 
imported as ORC data into a Hive table.  Looking at Parquet vs ORC, the 
datasizes are about 3x larger.  Kudu seems like will give use a much more 
performant and natural fit to our dataset, but if it's 2x larger than Parquet 
again, that really increases the costs of storage.  

So on that note, I'm not looking for an exact number on the size.  If it's off 
say +-5% (for a number), that's certainly close enough in the ballpark.

> 
> If you have any interest in helping to build such a tool I'd be happy to 
> point you in the right direction. Otherwise let's file a JIRA to add this as 
> a new feature in a future release.

Let me poke and ponder a bit on that first and see what I can get via hack & 
kludge.  We need to publish our metrics in a CSV format for the monitoring bit, 
so I don't know how necessarily useful our solution would be to the larger 
community.

Thanks

--Rick

> 
> -Todd
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to