Re: Schema Normalization

Benjamin Kim Mon, 10 Oct 2016 16:52:11 -0700

Todd,

Our usage is very basic right now, but if we do expand to doing more in the 
area of analytics, then we will consider using Impala too. Right now, we want 
to prove the power of Kudu to the coders, who despise SQL, and then give the 
analysts a go at it. They will need a JDBC interface in which Impala would help.


Thanks,
Ben


> On Oct 10, 2016, at 4:46 PM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> On Mon, Oct 10, 2016 at 4:44 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> We are not going crazy with normalization. Actually, we are only normalizing 
> where necessary. For example, we have a table for profiles and behaviors. 
> They are joined together by a behavior status table. Each one of these tables 
> are de-normalized when it comes to basic attributes. That’s the extent of it. 
> From the sound of it, it looks like we are good for now.
> 
> Yea, sounds good.
> 
> One thing to keep an eye on is https://issues.cloudera.org/browse/IMPALA-4252 
> <https://issues.cloudera.org/browse/IMPALA-4252> if you use Impala -this 
> should help a lot wth joins where one side of the join has selective 
> predicates on a large table.
> 
> -Todd
>  
> 
>> On Oct 10, 2016, at 4:15 PM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> Hey Ben,
>> 
>> Yea, we currently don't do great with very wide tables. For example, on 
>> flushes, we'll separately write and fsync each of the underlying columns, so 
>> if you have hundreds, it can get very expensive. Another factor is that 
>> currently every 'Write' RPC actually contains the full schema information 
>> for all columns, regardless of whether you've set them for a particular row.
>> 
>> I'm sure we'll make improvements in these areas in the coming months/years, 
>> but for now, the recommendation is to stick with a schema that looks more 
>> like an RDBMS schema than an HBase one.
>> 
>> However, I wouldn't go _crazy_ on normalization. For example, I wouldn't 
>> bother normalizing out a 'date' column into a 'date_id' and separate 'dates' 
>> table, as one might have done in a fully normalized RDBMS table in days of 
>> yore. Kudu's columnar layout, in conjunction with encodings like dictionary 
>> encoding, make that kind of normalization ineffective or even 
>> counter-productive as they introduce extra joins and query-time complexity.
>> 
>> One other item to note is that with more normalized schemas, it requires 
>> more of your query engine's planning capabilities. If you aren't doing 
>> joins, a very dumb query planner is fine. If you're doing complex joins 
>> across 10+ tables, then the quality of plans makes an enormous difference in 
>> query performance. To speak in concrete terms, I would guess that with more 
>> heavily normalized schemas, Impala's query planner would do a lot better job 
>> than Spark's, given that we don't currently expose information on table 
>> sizes to Spark and thus it's likely to do a poor job of join ordering.
>> 
>> Hope that helps
>> 
>> -Todd
>> 
>> 
>> On Fri, Oct 7, 2016 at 7:47 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> I would like to know if normalization techniques should or should not be 
>> necessary when modeling table schemas in Kudu. I read that a table with 
>> around 50 columns is ideal. This would mean a very wide table should be 
>> avoided.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> 
>> 
>> -- 
>> Todd Lipcon
>> Software Engineer, Cloudera
> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: Schema Normalization

Reply via email to