Re: Use derived or Joint

Alberto Ramón Mon, 12 Dec 2016 00:28:21 -0800

Thanks, your notes about Hierarchy are very good and important

2016-12-12 2:49 GMT+01:00 ShaoFeng Shi <[email protected]>:


> " define "year" - "IDDate" as a hierarchy", which is the benefit ?
> --> The combination "year" + "IDDate" has the same line number as the
> combination "IDDate"; so aggregate from the former to the latter will not
> aggregate much; Then we can prune the later with the "hierarchy" to reduce
> the cube size;
>
> Nowadays, in derived columns, the Host column, is always the PK of table ?
> --> Yes
>
> 2016-12-10 20:25 GMT+08:00 Alberto Ramón <[email protected]>:
>
>> thanks for you clear explanation !!
>>
>>
>> The only point that I can't understand is
>> " define "year" - "IDDate" as a hierarchy", which is the benefit ?
>>       [image: Imágenes integradas 1]
>> Where:
>>
>>    -
>>
>>    IDData is PK of Dim table, Unique & Identity
>>    - Year is a Normal Dim --> I will have precalculated by years
>>
>>
>> Nowadays, in derived columns, the Host column, is always the PK of table ?
>>
>>
>>
>>
>> 2016-12-09 15:25 GMT+01:00 ShaoFeng Shi <[email protected]>:
>>
>>> Hi Albert, I think you're raising a good question; Many users face such
>>> questions when using Kylin in their cases. Let me try to share some my
>>> cents.
>>>
>>> "Derived" or "Joint" ?
>>> These are two independent means in Kylin (they're not conflict). Using
>>> which depends on how these dimensions being used I think;
>>>
>>> Take the "IDDate" case you mentioned as an example; If most of you
>>> queries are aggregated at the PK/FK level (which is date), and user just
>>> want to in passing other fields like its "MonthTxt", "DayWeekTxt", defining
>>> them as "derived" will be very good.
>>>
>>> But if you also want to aggregate also at "MonthTxt" or "DayWeekTxt"
>>> level, defining them as "Derived" might not be good; Because Kylin need
>>> translates the condition of "MonthTxt" into a set of PK values ("IDDate"),
>>> and then query from Cube with these values, because the cube only
>>> pre-aggreated at "IDDate"; This will slow down the query; (Ofcause if your
>>> dataset is small it still be acceptable)
>>>
>>> Besides, defining "Year" - "DayWeek_ID" - "Month_ID" as hierarchy is not
>>> suggested, because they are not a hierarchy relationship, but in parallel
>>> here;  ("March" is not a child of "2016", it appears in every year)
>>>
>>> "Joint" can be used in two typical cases:
>>> 1) combine multiple ultra low cardinality dimensions
>>> 2) combine dimensions which has 1:1 (like "Month_ID" "Month_Txt") or
>>> close to 1:1 relationship (like "USER_ID", "USER_EMAIL")
>>>
>>> For case 1, I might design the cube in this way (assume you have the
>>> need to group by year, month, dayweek):
>>> 1) define all them as normal dimension
>>> 2) define "year" - "IDDate" as a hierarchy
>>> 3) define "Month_ID" + "Month_Txt" as a joint, because they're 1:1
>>> 4) define "DayWeek_ID" + "DayWeek_Txt" as a joint, because they're 1:1
>>>
>>>
>>> For case 2, I have the same suggestion as above.
>>>
>>> 2016-12-09 7:10 GMT+08:00 Alberto Ramón <[email protected]>:
>>>
>>>> Typical case 1:
>>>>
>>>> *IDDate*
>>>>
>>>> *Month_ID*
>>>>
>>>> *Month_Txt*
>>>>
>>>> *DayWeek_ID*
>>>>
>>>> *DayWeek_Txt*
>>>>
>>>> *Year*
>>>>
>>>> 2016-03-01
>>>>
>>>> 3
>>>>
>>>> March
>>>>
>>>> 2
>>>>
>>>> Wendesday
>>>>
>>>> 2016
>>>>
>>>> 2016-03-02
>>>>
>>>> 3
>>>>
>>>> March
>>>>
>>>> 3
>>>>
>>>> Thursday
>>>>
>>>> 2016
>>>>
>>>> 2016-03-02
>>>>
>>>> 3
>>>>
>>>> March
>>>>
>>>> 4
>>>>
>>>> Friday
>>>>
>>>> 2016
>>>>
>>>> IDDate is PK of Dim table and Unique
>>>>
>>>>
>>>> SOL 1: Uses Hierarchy and Derived from non PK column
>>>>
>>>>
>>>> *Month_ID*
>>>>
>>>> Hierarchy 2
>>>>
>>>> Normal 1
>>>>
>>>> *Month_Txt*
>>>>
>>>>
>>>> Derived 1
>>>>
>>>> *DayWeek_ID*
>>>>
>>>> Hierarchy 3
>>>>
>>>> Normal 2
>>>>
>>>> *DayWeek_Txt*
>>>>
>>>>
>>>> Derived 2
>>>>
>>>> *Year*
>>>>
>>>> Hierarchy 1
>>>>
>>>> Normal 3
>>>>
>>>> Year > Month > Day
>>>>
>>>> Text are derived from ID (in month and Week)
>>>>
>>>> PB1: KYLIN-444 <https://issues.apache.org/jira/browse/KYLIN-444>
>>>>
>>>> PB2: I don't know how create Derived column from non PK with actual UI 
>>>> (Kylin
>>>> – 1313 <https://issues.apache.org/jira/browse/KYLIN-1313> v1.5.2 Kylin
>>>> 1786 <https://issues.apache.org/jira/browse/KYLIN-1786>, v1.5.3)
>>>>
>>>>
>>>>
>>>> SOL 2:
>>>>
>>>> *Month_ID*
>>>>
>>>> Hierarchy 2
>>>>
>>>> Join 1
>>>>
>>>> *Month_Txt*
>>>>
>>>>
>>>> Join 1
>>>>
>>>> *DayWeek_ID*
>>>>
>>>> Hierarchy 3
>>>>
>>>> Join 2
>>>>
>>>> *DayWeek_Txt*
>>>>
>>>>
>>>> Join 2
>>>>
>>>> *Year*
>>>>
>>>> Hierarchy 1
>>>>
>>>> Normal 3
>>>>
>>>>
>>>> SOL 2 is this the best solution ??
>>>>
>>>>
>>>>
>>>> Typical case 2:
>>>>
>>>> I see the same scenario a lot of times (derived columns with 1:1
>>>> Relation)
>>>>
>>>> Product_ID *(PK)*
>>>>
>>>> Product_TXT
>>>>
>>>> TypeProduct_ID
>>>>
>>>> TypeProduct_TXT
>>>>
>>>> Country_TXT
>>>>
>>>> Country_ID
>>>>
>>>> Optimize queries by product / category / country, are mandatory
>>>>
>>>> Perhaps,
>>>>
>>>> Country (lower cardinality) its a good candidate to Join
>>>>
>>>> I don't want put Product_TXT as Join, because is a longgggg text, and
>>>> can be affect Row_Key of HBase, but I need Queries like ... where
>>>> product_TXT = ""iRobot Roomba 650 Robotic Vacuum Cleaner
>>>>
>>>> suggestions ?
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>>
>>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: Use derived or Joint

Reply via email to