On Mon, Oct 10, 2016 at 4:44 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
> Todd, > > We are not going crazy with normalization. Actually, we are only > normalizing where necessary. For example, we have a table for profiles and > behaviors. They are joined together by a behavior status table. Each one of > these tables are de-normalized when it comes to basic attributes. That’s > the extent of it. From the sound of it, it looks like we are good for now. > Yea, sounds good. One thing to keep an eye on is https://issues.cloudera.org/browse/IMPALA-4252 if you use Impala -this should help a lot wth joins where one side of the join has selective predicates on a large table. -Todd > > On Oct 10, 2016, at 4:15 PM, Todd Lipcon <t...@cloudera.com> wrote: > > Hey Ben, > > Yea, we currently don't do great with very wide tables. For example, on > flushes, we'll separately write and fsync each of the underlying columns, > so if you have hundreds, it can get very expensive. Another factor is that > currently every 'Write' RPC actually contains the full schema information > for all columns, regardless of whether you've set them for a particular row. > > I'm sure we'll make improvements in these areas in the coming > months/years, but for now, the recommendation is to stick with a schema > that looks more like an RDBMS schema than an HBase one. > > However, I wouldn't go _crazy_ on normalization. For example, I wouldn't > bother normalizing out a 'date' column into a 'date_id' and separate > 'dates' table, as one might have done in a fully normalized RDBMS table in > days of yore. Kudu's columnar layout, in conjunction with encodings like > dictionary encoding, make that kind of normalization ineffective or even > counter-productive as they introduce extra joins and query-time complexity. > > One other item to note is that with more normalized schemas, it requires > more of your query engine's planning capabilities. If you aren't doing > joins, a very dumb query planner is fine. If you're doing complex joins > across 10+ tables, then the quality of plans makes an enormous difference > in query performance. To speak in concrete terms, I would guess that with > more heavily normalized schemas, Impala's query planner would do a lot > better job than Spark's, given that we don't currently expose information > on table sizes to Spark and thus it's likely to do a poor job of join > ordering. > > Hope that helps > > -Todd > > > On Fri, Oct 7, 2016 at 7:47 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > >> I would like to know if normalization techniques should or should not be >> necessary when modeling table schemas in Kudu. I read that a table with >> around 50 columns is ideal. This would mean a very wide table should be >> avoided. >> >> Thanks, >> Ben >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera