Re: Good way to find "Real" size of the tables

Weber, Richard Mon, 12 Dec 2016 07:30:24 -0800

Sorry, wound up setting this part of my project aside to complete the remainder 
of my evaluation of Kudu.  I do hope I'll be able to swing back around as I'd 
like to compare sizing of different knobs we've twisted in Kudu, as well as 
against HDFS basaed file formats.


 

I'll definitely post an update/script once I get something together.

 

-- Rick Weber

 

 

From: Todd Lipcon <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Monday, December 12, 2016 at 1:22 AM
To: "[email protected]" <[email protected]>
Subject: Re: Good way to find "Real" size of the tables

 

Hey Rick, 

 

Just wanted to check and see if you were able to make any progress on the 
experiments you were running. Would be great to share your findings or any 
issues you encountered.

 

-Todd

 

On Thu, Dec 1, 2016 at 10:49 PM, Weber, Richard <[email protected]> wrote:

Comments below 

 

 

On Nov 30, 2016, at 4:29 PM, Todd Lipcon <[email protected]> wrote:

 

On Wed, Nov 30, 2016 at 6:26 AM, Weber, Richard <[email protected]> wrote:

Hi All, 

 

I'm trying to figure out the right/best/easiest way to find out how much space 
that a given table is taking up on the various tablet servers.  I'm looking 
really at finding:

* Physical space taken on all disks

* Logical space taken on all disks

* Sizing of Indices/Bloom Filters, etc.

* Sizing with and without replication.

 

I'm trying to run an apples vs apples comparison of how big data is when stored 
in Kudu compared to storing it in it's native format (Gzipped CSV) as well as 
in Parquet format on HDFS.  Ultimately, I'd like to be able to do reporting on 
the different tables to say Table X is taking up Y Tb, where Y consists of A 
physical size, B Index, C Bloom, etc.

 

Looking through the Web UI I don't really see any good summary of how much 
space the entire table is taking.  It seems like I'd need to walk through each 
Tablet server, connect to the metrics page and generate the summary information 
myself.

 

 

Yea, unfortunately we do not expose much of this information in a useful way at 
the moment. The metrics page is the best source of info for the various sizes, 
and even those are often estimates rather than always being accurate at the 
moment.

 

Ok



 

In terms of cross-server metrics aggregation, it's been our philosophy so far 
that we should try to avoid doing a poor job of things that other systems are 
likely to do better -- metrics aggregation being one such thing. It's likely 
we'll add simple aggregation of table sizes, since that info is very useful for 
SQL engines to do JOIN ordering, but I don't think we'd start adding the more 
granular breakdowns like indexes, blooms, etc.

 

Definitely understand on that.  Index sizes (and sizes of other related data) 
are mainly of interests to me just to compare what the performance improvements 
of Kudu vs Parquet vs CSV "cost" in terms of storage.



 

If your use case is a one-time experiment to understand the data volumes, it 
would be pretty straightforward to write a tool to do this kind of summary 
against the on-disk metadata of a tablet server. For example, you can load the 
tablet metadata, group the blocks by type/column, and then aggregate as you 
prefer. Unfortunately this would give you only the physical size and not the 
logical, since you'd have to scan the actual data to know its uncompressed 
sizes.

 

I'm looking for the sizings really for two purposes.  

1) As mentioned above, to help assess the "costs" Kudu vs other systems we 
already have in place, especially in terms of Storage

2) Perform longer-term monitoring of sizing of different table sizes, how 
they're growing, how much resources they're using and so on.

 

For one particular use case we have, our data comes in as Protobuf data, and is 
imported as ORC data into a Hive table.  Looking at Parquet vs ORC, the 
datasizes are about 3x larger.  Kudu seems like will give use a much more 
performant and natural fit to our dataset, but if it's 2x larger than Parquet 
again, that really increases the costs of storage.  

 

So on that note, I'm not looking for an exact number on the size.  If it's off 
say +-5% (for a number), that's certainly close enough in the ballpark.

 

 

If you have any interest in helping to build such a tool I'd be happy to point 
you in the right direction. Otherwise let's file a JIRA to add this as a new 
feature in a future release.

 

Let me poke and ponder a bit on that first and see what I can get via hack & 
kludge.  We need to publish our metrics in a CSV format for the monitoring bit, 
so I don't know how necessarily useful our solution would be to the larger 
community.

 

Thanks

 

--Rick



 

-Todd

-- 

Todd Lipcon
Software Engineer, Cloudera

 



 

-- 

Todd Lipcon
Software Engineer, Cloudera

smime.p7s
Description: S/MIME cryptographic signature

Re: Good way to find "Real" size of the tables

Reply via email to