Could be this related? KYLIN-2779
<https://issues.apache.org/jira/browse/KYLIN-2779>, this JIRA have a lot of
sense

On 24 January 2018 at 13:43, ShaoFeng Shi <shaofeng...@apache.org> wrote:

> Hi Qilong,
>
> If seg A's estimation size is 10 GB, but real size is 5 GB; then when
> merge or build another segment, we can adjust the estimated size by divide
> by 2. Then it should be closer with real size.
>
> 2018-01-24 9:49 GMT+08:00 苏启龙 <suqil...@qiyi.com>:
>
>> Many thanks shaofeng! We’ll check more on these parameters to see how to
>> make it better.
>>
>> BTW, what do u mean by the last line? I mean by which way I can introduce
>> the actual size to help Kylin to adjust the estimation? Currently I can
>> only use the max-regions parameter manually, but this is not convenient for
>> auto-merging.
>>
>> QIlong
>>
>> 发件人: ShaoFeng Shi <shaofeng...@apache.org>
>> 答复: "user@kylin.apache.org" <user@kylin.apache.org>
>> 日期: 2018年1月23日 星期二 21:49
>>
>> 至: user <user@kylin.apache.org>
>> 抄送: 林豪(linhao)-技术产品中心 <lin...@qiyi.com>
>> 主题: Re: segment size estimate when merging
>>
>> Hi Qilong,
>>
>> Does your cube have count-distinct or Top-N measure?
>>
>> If you observed that there are too many or too small hbase regions, you
>> can adjust some parameters:
>>
>> kylin.cube.size-estimate-ratio=0.25
>> kylin.cube.size-estimate-countdistinct-ratio=0.05
>>
>> The default ratio for common case is 0.25, you can set it to smaller if
>> the estimated size is bigger than actual size. These two parameters can be
>> set at Cube level.
>>
>> A better way is when doing merge, using the actual size of existing
>> segments to adjust the estimated size, then get a closer result.
>>
>> 2018-01-23 14:47 GMT+08:00 苏启龙 <suqil...@qiyi.com>:
>>
>>> Hi shaofeng,
>>>
>>> Yes, it’s usually smaller then the sum of each segment, but usually a
>>> small amount compared with the total size.
>>>
>>> But for the statistics estimate, usually result in a N times larger than
>>> it actually be, and results in a huge waste of HBase region numbers。
>>>
>>>
>>>    1. Do you have any data about deviation of the two ways in
>>>    statistics? I mean generally which way will be closer?
>>>    2. Is there any improve plan for this in the roadmap? Or some
>>>    consideration to give more options to user to select their own estimate
>>>    algo?
>>>
>>>
>>> Thanks
>>>
>>> Qilong
>>>
>>> 发件人: ShaoFeng Shi <shaofeng...@apache.org>
>>> 答复: "user@kylin.apache.org" <user@kylin.apache.org>
>>> 日期: 2018年1月23日 星期二 09:43
>>> 至: user <user@kylin.apache.org>
>>> 抄送: 林豪(linhao)-技术产品中心 <lin...@qiyi.com>
>>> 主题: Re: segment size estimate when merging
>>>
>>> Hi Qilong,
>>>
>>> When merging segments, the dimension-measure values (k-v) will be
>>> re-orged and the same key will be merged, so the merged size is not simply
>>> a sum of each segment; usually, it is smaller than before.
>>>
>>> Always using the statistics to estimate the size is for consistency. Of
>>> course, there is room to improve the estimation accuracy.
>>>
>>>
>>>
>>> 2018-01-22 16:54 GMT+08:00 苏启龙 <suqil...@qiyi.com>:
>>>
>>>>
>>>> Hi,
>>>>
>>>> We have some unclear points about the segment size estimate when
>>>> merging multi-segments.
>>>>
>>>> We find that the segment merge job still uses
>>>> CubeStatsReader::getCuboidSizeMap to estimate the total size of the
>>>> merged segment. From our understanding, when building a new segment, Kylin
>>>> uses this way to estimate the total size is OK since no other info we can
>>>> turn to. But in merging we may sum the table size of the segments to be
>>>> merged, which should be more accurate.
>>>>
>>>> So why for this consideration?
>>>>
>>>>
>>>>
>>>> Su Qilong
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>>
>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Reply via email to