Thanks, your notes about Hierarchy are very good and important 2016-12-12 2:49 GMT+01:00 ShaoFeng Shi <[email protected]>:
> " define "year" - "IDDate" as a hierarchy", which is the benefit ? > --> The combination "year" + "IDDate" has the same line number as the > combination "IDDate"; so aggregate from the former to the latter will not > aggregate much; Then we can prune the later with the "hierarchy" to reduce > the cube size; > > Nowadays, in derived columns, the Host column, is always the PK of table ? > --> Yes > > 2016-12-10 20:25 GMT+08:00 Alberto Ramón <[email protected]>: > >> thanks for you clear explanation !! >> >> >> The only point that I can't understand is >> " define "year" - "IDDate" as a hierarchy", which is the benefit ? >> [image: Imágenes integradas 1] >> Where: >> >> - >> >> IDData is PK of Dim table, Unique & Identity >> - Year is a Normal Dim --> I will have precalculated by years >> >> >> Nowadays, in derived columns, the Host column, is always the PK of table ? >> >> >> >> >> 2016-12-09 15:25 GMT+01:00 ShaoFeng Shi <[email protected]>: >> >>> Hi Albert, I think you're raising a good question; Many users face such >>> questions when using Kylin in their cases. Let me try to share some my >>> cents. >>> >>> "Derived" or "Joint" ? >>> These are two independent means in Kylin (they're not conflict). Using >>> which depends on how these dimensions being used I think; >>> >>> Take the "IDDate" case you mentioned as an example; If most of you >>> queries are aggregated at the PK/FK level (which is date), and user just >>> want to in passing other fields like its "MonthTxt", "DayWeekTxt", defining >>> them as "derived" will be very good. >>> >>> But if you also want to aggregate also at "MonthTxt" or "DayWeekTxt" >>> level, defining them as "Derived" might not be good; Because Kylin need >>> translates the condition of "MonthTxt" into a set of PK values ("IDDate"), >>> and then query from Cube with these values, because the cube only >>> pre-aggreated at "IDDate"; This will slow down the query; (Ofcause if your >>> dataset is small it still be acceptable) >>> >>> Besides, defining "Year" - "DayWeek_ID" - "Month_ID" as hierarchy is not >>> suggested, because they are not a hierarchy relationship, but in parallel >>> here; ("March" is not a child of "2016", it appears in every year) >>> >>> "Joint" can be used in two typical cases: >>> 1) combine multiple ultra low cardinality dimensions >>> 2) combine dimensions which has 1:1 (like "Month_ID" "Month_Txt") or >>> close to 1:1 relationship (like "USER_ID", "USER_EMAIL") >>> >>> For case 1, I might design the cube in this way (assume you have the >>> need to group by year, month, dayweek): >>> 1) define all them as normal dimension >>> 2) define "year" - "IDDate" as a hierarchy >>> 3) define "Month_ID" + "Month_Txt" as a joint, because they're 1:1 >>> 4) define "DayWeek_ID" + "DayWeek_Txt" as a joint, because they're 1:1 >>> >>> >>> For case 2, I have the same suggestion as above. >>> >>> 2016-12-09 7:10 GMT+08:00 Alberto Ramón <[email protected]>: >>> >>>> Typical case 1: >>>> >>>> *IDDate* >>>> >>>> *Month_ID* >>>> >>>> *Month_Txt* >>>> >>>> *DayWeek_ID* >>>> >>>> *DayWeek_Txt* >>>> >>>> *Year* >>>> >>>> 2016-03-01 >>>> >>>> 3 >>>> >>>> March >>>> >>>> 2 >>>> >>>> Wendesday >>>> >>>> 2016 >>>> >>>> 2016-03-02 >>>> >>>> 3 >>>> >>>> March >>>> >>>> 3 >>>> >>>> Thursday >>>> >>>> 2016 >>>> >>>> 2016-03-02 >>>> >>>> 3 >>>> >>>> March >>>> >>>> 4 >>>> >>>> Friday >>>> >>>> 2016 >>>> >>>> IDDate is PK of Dim table and Unique >>>> >>>> >>>> SOL 1: Uses Hierarchy and Derived from non PK column >>>> >>>> >>>> *Month_ID* >>>> >>>> Hierarchy 2 >>>> >>>> Normal 1 >>>> >>>> *Month_Txt* >>>> >>>> >>>> Derived 1 >>>> >>>> *DayWeek_ID* >>>> >>>> Hierarchy 3 >>>> >>>> Normal 2 >>>> >>>> *DayWeek_Txt* >>>> >>>> >>>> Derived 2 >>>> >>>> *Year* >>>> >>>> Hierarchy 1 >>>> >>>> Normal 3 >>>> >>>> Year > Month > Day >>>> >>>> Text are derived from ID (in month and Week) >>>> >>>> PB1: KYLIN-444 <https://issues.apache.org/jira/browse/KYLIN-444> >>>> >>>> PB2: I don't know how create Derived column from non PK with actual UI >>>> (Kylin >>>> – 1313 <https://issues.apache.org/jira/browse/KYLIN-1313> v1.5.2 Kylin >>>> 1786 <https://issues.apache.org/jira/browse/KYLIN-1786>, v1.5.3) >>>> >>>> >>>> >>>> SOL 2: >>>> >>>> *Month_ID* >>>> >>>> Hierarchy 2 >>>> >>>> Join 1 >>>> >>>> *Month_Txt* >>>> >>>> >>>> Join 1 >>>> >>>> *DayWeek_ID* >>>> >>>> Hierarchy 3 >>>> >>>> Join 2 >>>> >>>> *DayWeek_Txt* >>>> >>>> >>>> Join 2 >>>> >>>> *Year* >>>> >>>> Hierarchy 1 >>>> >>>> Normal 3 >>>> >>>> >>>> SOL 2 is this the best solution ?? >>>> >>>> >>>> >>>> Typical case 2: >>>> >>>> I see the same scenario a lot of times (derived columns with 1:1 >>>> Relation) >>>> >>>> Product_ID *(PK)* >>>> >>>> Product_TXT >>>> >>>> TypeProduct_ID >>>> >>>> TypeProduct_TXT >>>> >>>> Country_TXT >>>> >>>> Country_ID >>>> >>>> Optimize queries by product / category / country, are mandatory >>>> >>>> Perhaps, >>>> >>>> Country (lower cardinality) its a good candidate to Join >>>> >>>> I don't want put Product_TXT as Join, because is a longgggg text, and >>>> can be affect Row_Key of HBase, but I need Queries like ... where >>>> product_TXT = ""iRobot Roomba 650 Robotic Vacuum Cleaner >>>> >>>> suggestions ? >>>> >>> >>> >>> >>> -- >>> Best regards, >>> >>> Shaofeng Shi 史少锋 >>> >>> >> > > > -- > Best regards, > > Shaofeng Shi 史少锋 > >
