Many thanks shaofeng! We’ll check more on these parameters to see how to make 
it better.

BTW, what do u mean by the last line? I mean by which way I can introduce the 
actual size to help Kylin to adjust the estimation? Currently I can only use 
the max-regions parameter manually, but this is not convenient for auto-merging.

QIlong

发件人: ShaoFeng Shi <[email protected]<mailto:[email protected]>>
答复: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
日期: 2018年1月23日 星期二 21:49
至: user <[email protected]<mailto:[email protected]>>
抄送: 林豪(linhao)-技术产品中心 <[email protected]<mailto:[email protected]>>
主题: Re: segment size estimate when merging

Hi Qilong,

Does your cube have count-distinct or Top-N measure?

If you observed that there are too many or too small hbase regions, you can 
adjust some parameters:

kylin.cube.size-estimate-ratio=0.25
kylin.cube.size-estimate-countdistinct-ratio=0.05

The default ratio for common case is 0.25, you can set it to smaller if the 
estimated size is bigger than actual size. These two parameters can be set at 
Cube level.

A better way is when doing merge, using the actual size of existing segments to 
adjust the estimated size, then get a closer result.

2018-01-23 14:47 GMT+08:00 苏启龙 <[email protected]<mailto:[email protected]>>:
Hi shaofeng,

Yes, it’s usually smaller then the sum of each segment, but usually a small 
amount compared with the total size.

But for the statistics estimate, usually result in a N times larger than it 
actually be, and results in a huge waste of HBase region numbers。


  1.  Do you have any data about deviation of the two ways in statistics? I 
mean generally which way will be closer?
  2.  Is there any improve plan for this in the roadmap? Or some consideration 
to give more options to user to select their own estimate algo?

Thanks

Qilong

发件人: ShaoFeng Shi <[email protected]<mailto:[email protected]>>
答复: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
日期: 2018年1月23日 星期二 09:43
至: user <[email protected]<mailto:[email protected]>>
抄送: 林豪(linhao)-技术产品中心 <[email protected]<mailto:[email protected]>>
主题: Re: segment size estimate when merging

Hi Qilong,

When merging segments, the dimension-measure values (k-v) will be re-orged and 
the same key will be merged, so the merged size is not simply a sum of each 
segment; usually, it is smaller than before.

Always using the statistics to estimate the size is for consistency. Of course, 
there is room to improve the estimation accuracy.



2018-01-22 16:54 GMT+08:00 苏启龙 <[email protected]<mailto:[email protected]>>:

Hi,

We have some unclear points about the segment size estimate when merging 
multi-segments.

We find that the segment merge job still uses CubeStatsReader::getCuboidSizeMap 
to estimate the total size of the merged segment. From our understanding, when 
building a new segment, Kylin uses this way to estimate the total size is OK 
since no other info we can turn to. But in merging we may sum the table size of 
the segments to be merged, which should be more accurate.

So why for this consideration?



Su Qilong



--
Best regards,

Shaofeng Shi 史少锋




--
Best regards,

Shaofeng Shi 史少锋

Reply via email to