Re: Schema Normalization

Todd Lipcon Mon, 10 Oct 2016 16:55:34 -0700

On Mon, Oct 10, 2016 at 4:51 PM, Benjamin Kim <[email protected]> wrote:


> Todd,
>
> Our usage is very basic right now, but if we do expand to doing more in
> the area of analytics, then we will consider using Impala too. Right now,
> we want to prove the power of Kudu to the coders, who despise SQL, and then
> give the analysts a go at it. They will need a JDBC interface in which
> Impala would help.
>

Got it, makes sense. That's one of Kudu's selling points -- those who hate
SQL can use APIs, and those who hate APIs can use SQL :)

Just wanted to point out that, if you do try SQL and find bad performance
on certain types of joins, that you can expect improvements down the line.

-Todd



> On Oct 10, 2016, at 4:46 PM, Todd Lipcon <[email protected]> wrote:
>
> On Mon, Oct 10, 2016 at 4:44 PM, Benjamin Kim <[email protected]> wrote:
>
>> Todd,
>>
>> We are not going crazy with normalization. Actually, we are only
>> normalizing where necessary. For example, we have a table for profiles and
>> behaviors. They are joined together by a behavior status table. Each one of
>> these tables are de-normalized when it comes to basic attributes. That’s
>> the extent of it. From the sound of it, it looks like we are good for now.
>>
>
> Yea, sounds good.
>
> One thing to keep an eye on is https://issues.cloudera.
> org/browse/IMPALA-4252 if you use Impala -this should help a lot wth
> joins where one side of the join has selective predicates on a large table.
>
> -Todd
>
>
>>
>> On Oct 10, 2016, at 4:15 PM, Todd Lipcon <[email protected]> wrote:
>>
>> Hey Ben,
>>
>> Yea, we currently don't do great with very wide tables. For example, on
>> flushes, we'll separately write and fsync each of the underlying columns,
>> so if you have hundreds, it can get very expensive. Another factor is that
>> currently every 'Write' RPC actually contains the full schema information
>> for all columns, regardless of whether you've set them for a particular row.
>>
>> I'm sure we'll make improvements in these areas in the coming
>> months/years, but for now, the recommendation is to stick with a schema
>> that looks more like an RDBMS schema than an HBase one.
>>
>> However, I wouldn't go _crazy_ on normalization. For example, I wouldn't
>> bother normalizing out a 'date' column into a 'date_id' and separate
>> 'dates' table, as one might have done in a fully normalized RDBMS table in
>> days of yore. Kudu's columnar layout, in conjunction with encodings like
>> dictionary encoding, make that kind of normalization ineffective or even
>> counter-productive as they introduce extra joins and query-time complexity.
>>
>> One other item to note is that with more normalized schemas, it requires
>> more of your query engine's planning capabilities. If you aren't doing
>> joins, a very dumb query planner is fine. If you're doing complex joins
>> across 10+ tables, then the quality of plans makes an enormous difference
>> in query performance. To speak in concrete terms, I would guess that with
>> more heavily normalized schemas, Impala's query planner would do a lot
>> better job than Spark's, given that we don't currently expose information
>> on table sizes to Spark and thus it's likely to do a poor job of join
>> ordering.
>>
>> Hope that helps
>>
>> -Todd
>>
>>
>> On Fri, Oct 7, 2016 at 7:47 PM, Benjamin Kim <[email protected]> wrote:
>>
>>> I would like to know if normalization techniques should or should not be
>>> necessary when modeling table schemas in Kudu. I read that a table with
>>> around 50 columns is ideal. This would mean a very wide table should be
>>> avoided.
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Schema Normalization

Reply via email to