Re: Schema Normalization

Todd Lipcon Mon, 10 Oct 2016 16:46:49 -0700

On Mon, Oct 10, 2016 at 4:44 PM, Benjamin Kim <bbuil...@gmail.com> wrote:


> Todd,
>
> We are not going crazy with normalization. Actually, we are only
> normalizing where necessary. For example, we have a table for profiles and
> behaviors. They are joined together by a behavior status table. Each one of
> these tables are de-normalized when it comes to basic attributes. That’s
> the extent of it. From the sound of it, it looks like we are good for now.
>

Yea, sounds good.

One thing to keep an eye on is
https://issues.cloudera.org/browse/IMPALA-4252 if you use Impala -this
should help a lot wth joins where one side of the join has selective
predicates on a large table.

-Todd


>
> On Oct 10, 2016, at 4:15 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
> Hey Ben,
>
> Yea, we currently don't do great with very wide tables. For example, on
> flushes, we'll separately write and fsync each of the underlying columns,
> so if you have hundreds, it can get very expensive. Another factor is that
> currently every 'Write' RPC actually contains the full schema information
> for all columns, regardless of whether you've set them for a particular row.
>
> I'm sure we'll make improvements in these areas in the coming
> months/years, but for now, the recommendation is to stick with a schema
> that looks more like an RDBMS schema than an HBase one.
>
> However, I wouldn't go _crazy_ on normalization. For example, I wouldn't
> bother normalizing out a 'date' column into a 'date_id' and separate
> 'dates' table, as one might have done in a fully normalized RDBMS table in
> days of yore. Kudu's columnar layout, in conjunction with encodings like
> dictionary encoding, make that kind of normalization ineffective or even
> counter-productive as they introduce extra joins and query-time complexity.
>
> One other item to note is that with more normalized schemas, it requires
> more of your query engine's planning capabilities. If you aren't doing
> joins, a very dumb query planner is fine. If you're doing complex joins
> across 10+ tables, then the quality of plans makes an enormous difference
> in query performance. To speak in concrete terms, I would guess that with
> more heavily normalized schemas, Impala's query planner would do a lot
> better job than Spark's, given that we don't currently expose information
> on table sizes to Spark and thus it's likely to do a poor job of join
> ordering.
>
> Hope that helps
>
> -Todd
>
>
> On Fri, Oct 7, 2016 at 7:47 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> I would like to know if normalization techniques should or should not be
>> necessary when modeling table schemas in Kudu. I read that a table with
>> around 50 columns is ideal. This would mean a very wide table should be
>> avoided.
>>
>> Thanks,
>> Ben
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Schema Normalization

Reply via email to