Hey Ben,

Yea, we currently don't do great with very wide tables. For example, on
flushes, we'll separately write and fsync each of the underlying columns,
so if you have hundreds, it can get very expensive. Another factor is that
currently every 'Write' RPC actually contains the full schema information
for all columns, regardless of whether you've set them for a particular row.

I'm sure we'll make improvements in these areas in the coming months/years,
but for now, the recommendation is to stick with a schema that looks more
like an RDBMS schema than an HBase one.

However, I wouldn't go _crazy_ on normalization. For example, I wouldn't
bother normalizing out a 'date' column into a 'date_id' and separate
'dates' table, as one might have done in a fully normalized RDBMS table in
days of yore. Kudu's columnar layout, in conjunction with encodings like
dictionary encoding, make that kind of normalization ineffective or even
counter-productive as they introduce extra joins and query-time complexity.

One other item to note is that with more normalized schemas, it requires
more of your query engine's planning capabilities. If you aren't doing
joins, a very dumb query planner is fine. If you're doing complex joins
across 10+ tables, then the quality of plans makes an enormous difference
in query performance. To speak in concrete terms, I would guess that with
more heavily normalized schemas, Impala's query planner would do a lot
better job than Spark's, given that we don't currently expose information
on table sizes to Spark and thus it's likely to do a poor job of join
ordering.

Hope that helps

-Todd


On Fri, Oct 7, 2016 at 7:47 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> I would like to know if normalization techniques should or should not be
> necessary when modeling table schemas in Kudu. I read that a table with
> around 50 columns is ideal. This would mean a very wide table should be
> avoided.
>
> Thanks,
> Ben
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to