Todd, We are not going crazy with normalization. Actually, we are only normalizing where necessary. For example, we have a table for profiles and behaviors. They are joined together by a behavior status table. Each one of these tables are de-normalized when it comes to basic attributes. That’s the extent of it. From the sound of it, it looks like we are good for now.
Thanks, Ben > On Oct 10, 2016, at 4:15 PM, Todd Lipcon <t...@cloudera.com> wrote: > > Hey Ben, > > Yea, we currently don't do great with very wide tables. For example, on > flushes, we'll separately write and fsync each of the underlying columns, so > if you have hundreds, it can get very expensive. Another factor is that > currently every 'Write' RPC actually contains the full schema information for > all columns, regardless of whether you've set them for a particular row. > > I'm sure we'll make improvements in these areas in the coming months/years, > but for now, the recommendation is to stick with a schema that looks more > like an RDBMS schema than an HBase one. > > However, I wouldn't go _crazy_ on normalization. For example, I wouldn't > bother normalizing out a 'date' column into a 'date_id' and separate 'dates' > table, as one might have done in a fully normalized RDBMS table in days of > yore. Kudu's columnar layout, in conjunction with encodings like dictionary > encoding, make that kind of normalization ineffective or even > counter-productive as they introduce extra joins and query-time complexity. > > One other item to note is that with more normalized schemas, it requires more > of your query engine's planning capabilities. If you aren't doing joins, a > very dumb query planner is fine. If you're doing complex joins across 10+ > tables, then the quality of plans makes an enormous difference in query > performance. To speak in concrete terms, I would guess that with more heavily > normalized schemas, Impala's query planner would do a lot better job than > Spark's, given that we don't currently expose information on table sizes to > Spark and thus it's likely to do a poor job of join ordering. > > Hope that helps > > -Todd > > > On Fri, Oct 7, 2016 at 7:47 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > I would like to know if normalization techniques should or should not be > necessary when modeling table schemas in Kudu. I read that a table with > around 50 columns is ideal. This would mean a very wide table should be > avoided. > > Thanks, > Ben > > > > > -- > Todd Lipcon > Software Engineer, Cloudera