On Mon, Oct 10, 2016 at 4:51 PM, Benjamin Kim <[email protected]> wrote:
> Todd, > > Our usage is very basic right now, but if we do expand to doing more in > the area of analytics, then we will consider using Impala too. Right now, > we want to prove the power of Kudu to the coders, who despise SQL, and then > give the analysts a go at it. They will need a JDBC interface in which > Impala would help. > Got it, makes sense. That's one of Kudu's selling points -- those who hate SQL can use APIs, and those who hate APIs can use SQL :) Just wanted to point out that, if you do try SQL and find bad performance on certain types of joins, that you can expect improvements down the line. -Todd > On Oct 10, 2016, at 4:46 PM, Todd Lipcon <[email protected]> wrote: > > On Mon, Oct 10, 2016 at 4:44 PM, Benjamin Kim <[email protected]> wrote: > >> Todd, >> >> We are not going crazy with normalization. Actually, we are only >> normalizing where necessary. For example, we have a table for profiles and >> behaviors. They are joined together by a behavior status table. Each one of >> these tables are de-normalized when it comes to basic attributes. That’s >> the extent of it. From the sound of it, it looks like we are good for now. >> > > Yea, sounds good. > > One thing to keep an eye on is https://issues.cloudera. > org/browse/IMPALA-4252 if you use Impala -this should help a lot wth > joins where one side of the join has selective predicates on a large table. > > -Todd > > >> >> On Oct 10, 2016, at 4:15 PM, Todd Lipcon <[email protected]> wrote: >> >> Hey Ben, >> >> Yea, we currently don't do great with very wide tables. For example, on >> flushes, we'll separately write and fsync each of the underlying columns, >> so if you have hundreds, it can get very expensive. Another factor is that >> currently every 'Write' RPC actually contains the full schema information >> for all columns, regardless of whether you've set them for a particular row. >> >> I'm sure we'll make improvements in these areas in the coming >> months/years, but for now, the recommendation is to stick with a schema >> that looks more like an RDBMS schema than an HBase one. >> >> However, I wouldn't go _crazy_ on normalization. For example, I wouldn't >> bother normalizing out a 'date' column into a 'date_id' and separate >> 'dates' table, as one might have done in a fully normalized RDBMS table in >> days of yore. Kudu's columnar layout, in conjunction with encodings like >> dictionary encoding, make that kind of normalization ineffective or even >> counter-productive as they introduce extra joins and query-time complexity. >> >> One other item to note is that with more normalized schemas, it requires >> more of your query engine's planning capabilities. If you aren't doing >> joins, a very dumb query planner is fine. If you're doing complex joins >> across 10+ tables, then the quality of plans makes an enormous difference >> in query performance. To speak in concrete terms, I would guess that with >> more heavily normalized schemas, Impala's query planner would do a lot >> better job than Spark's, given that we don't currently expose information >> on table sizes to Spark and thus it's likely to do a poor job of join >> ordering. >> >> Hope that helps >> >> -Todd >> >> >> On Fri, Oct 7, 2016 at 7:47 PM, Benjamin Kim <[email protected]> wrote: >> >>> I would like to know if normalization techniques should or should not be >>> necessary when modeling table schemas in Kudu. I read that a table with >>> around 50 columns is ideal. This would mean a very wide table should be >>> avoided. >>> >>> Thanks, >>> Ben >>> >>> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
