Adding some guard rails to Kudu

2016-11-30 Thread Todd Lipcon
Hey folks,

I've started working on a few patches to add "guard rails" to various
user-specified dimensions in Kudu. In particular, I'm planning to add
limits to the following:

- max number of columns in a table (proposal: 300)
- max replication factor (proposal: 7)
- max table name or column name length (proposal: 256)
- max size of a binary/string column cell value (proposal: 64kb)

The reasoning is that, even though in some cases we don't know a specific
issue that will happen outside these limits, we've done very little testing
(and have no automated testing) outside of these ranges. In some cases, we
do know that there is a certain threshold that will cause a big problem (eg
large cell sizes can cause tablet servers to crash). In other cases, it's
just "unknown territory".

In all cases, I'm planning on making the limits overridable via an "unsafe"
configuration flag. That means that a user can run with
"--unlock_unsafe_flags --max_identifier_length=1000" if they want to, but
they're explicitly accepting some risk that they're entering untested
territory.

Of course, in all cases, if we hear that there are people who are bumping
the maxes higher than the defaults and having good results, we can consider
raising the maximum, but I think it's smarter to start conservatively low
and raise later as we increase test coverage. Also, I'm sure down the road
we'll add features such as BLOB support or sparse column support, and at
that time we can remove the corresponding guard rails.

I'm sending this note to both user@ and dev@ to solicit feedback. Are there
any other dimensions people can think of where we should probably add
guard-rails? Is anyone out there already outside of the above ranges and
can make a case that we're being too conservative?

Thanks
-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Good way to find "Real" size of the tables

2016-11-30 Thread Todd Lipcon
On Wed, Nov 30, 2016 at 6:26 AM, Weber, Richard  wrote:

> Hi All,
>
> I'm trying to figure out the right/best/easiest way to find out how much
> space that a given table is taking up on the various tablet servers.  I'm
> looking really at finding:
> * Physical space taken on all disks
> * Logical space taken on all disks
> * Sizing of Indices/Bloom Filters, etc.
> * Sizing with and without replication.
>
> I'm trying to run an apples vs apples comparison of how big data is when
> stored in Kudu compared to storing it in it's native format (Gzipped CSV)
> as well as in Parquet format on HDFS.  Ultimately, I'd like to be able to
> do reporting on the different tables to say Table X is taking up Y Tb,
> where Y consists of A physical size, B Index, C Bloom, etc.
>
> Looking through the Web UI I don't really see any good summary of how much
> space the entire table is taking.  It seems like I'd need to walk through
> each Tablet server, connect to the metrics page and generate the summary
> information myself.
>
>
Yea, unfortunately we do not expose much of this information in a useful
way at the moment. The metrics page is the best source of info for the
various sizes, and even those are often estimates rather than always being
accurate at the moment.

In terms of cross-server metrics aggregation, it's been our philosophy so
far that we should try to avoid doing a poor job of things that other
systems are likely to do better -- metrics aggregation being one such
thing. It's likely we'll add simple aggregation of table sizes, since that
info is very useful for SQL engines to do JOIN ordering, but I don't think
we'd start adding the more granular breakdowns like indexes, blooms, etc.

If your use case is a one-time experiment to understand the data volumes,
it would be pretty straightforward to write a tool to do this kind of
summary against the on-disk metadata of a tablet server. For example, you
can load the tablet metadata, group the blocks by type/column, and then
aggregate as you prefer. Unfortunately this would give you only the
physical size and not the logical, since you'd have to scan the actual data
to know its uncompressed sizes.

If you have any interest in helping to build such a tool I'd be happy to
point you in the right direction. Otherwise let's file a JIRA to add this
as a new feature in a future release.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Good way to find "Real" size of the tables

2016-11-30 Thread Weber, Richard
Hi All,

I'm trying to figure out the right/best/easiest way to find out how much space 
that a given table is taking up on the various tablet servers.  I'm looking 
really at finding:
* Physical space taken on all disks
* Logical space taken on all disks
* Sizing of Indices/Bloom Filters, etc.
* Sizing with and without replication.

I'm trying to run an apples vs apples comparison of how big data is when stored 
in Kudu compared to storing it in it's native format (Gzipped CSV) as well as 
in Parquet format on HDFS.  Ultimately, I'd like to be able to do reporting on 
the different tables to say Table X is taking up Y Tb, where Y consists of A 
physical size, B Index, C Bloom, etc.

Looking through the Web UI I don't really see any good summary of how much 
space the entire table is taking.  It seems like I'd need to walk through each 
Tablet server, connect to the metrics page and generate the summary information 
myself.

Am I overlooking something?

--Rick Weber
riwe...@akamai.com






smime.p7s
Description: S/MIME cryptographic signature