Could be this related? KYLIN-2779 <https://issues.apache.org/jira/browse/KYLIN-2779>, this JIRA have a lot of sense
On 24 January 2018 at 13:43, ShaoFeng Shi <shaofeng...@apache.org> wrote: > Hi Qilong, > > If seg A's estimation size is 10 GB, but real size is 5 GB; then when > merge or build another segment, we can adjust the estimated size by divide > by 2. Then it should be closer with real size. > > 2018-01-24 9:49 GMT+08:00 苏启龙 <suqil...@qiyi.com>: > >> Many thanks shaofeng! We’ll check more on these parameters to see how to >> make it better. >> >> BTW, what do u mean by the last line? I mean by which way I can introduce >> the actual size to help Kylin to adjust the estimation? Currently I can >> only use the max-regions parameter manually, but this is not convenient for >> auto-merging. >> >> QIlong >> >> 发件人: ShaoFeng Shi <shaofeng...@apache.org> >> 答复: "user@kylin.apache.org" <user@kylin.apache.org> >> 日期: 2018年1月23日 星期二 21:49 >> >> 至: user <user@kylin.apache.org> >> 抄送: 林豪(linhao)-技术产品中心 <lin...@qiyi.com> >> 主题: Re: segment size estimate when merging >> >> Hi Qilong, >> >> Does your cube have count-distinct or Top-N measure? >> >> If you observed that there are too many or too small hbase regions, you >> can adjust some parameters: >> >> kylin.cube.size-estimate-ratio=0.25 >> kylin.cube.size-estimate-countdistinct-ratio=0.05 >> >> The default ratio for common case is 0.25, you can set it to smaller if >> the estimated size is bigger than actual size. These two parameters can be >> set at Cube level. >> >> A better way is when doing merge, using the actual size of existing >> segments to adjust the estimated size, then get a closer result. >> >> 2018-01-23 14:47 GMT+08:00 苏启龙 <suqil...@qiyi.com>: >> >>> Hi shaofeng, >>> >>> Yes, it’s usually smaller then the sum of each segment, but usually a >>> small amount compared with the total size. >>> >>> But for the statistics estimate, usually result in a N times larger than >>> it actually be, and results in a huge waste of HBase region numbers。 >>> >>> >>> 1. Do you have any data about deviation of the two ways in >>> statistics? I mean generally which way will be closer? >>> 2. Is there any improve plan for this in the roadmap? Or some >>> consideration to give more options to user to select their own estimate >>> algo? >>> >>> >>> Thanks >>> >>> Qilong >>> >>> 发件人: ShaoFeng Shi <shaofeng...@apache.org> >>> 答复: "user@kylin.apache.org" <user@kylin.apache.org> >>> 日期: 2018年1月23日 星期二 09:43 >>> 至: user <user@kylin.apache.org> >>> 抄送: 林豪(linhao)-技术产品中心 <lin...@qiyi.com> >>> 主题: Re: segment size estimate when merging >>> >>> Hi Qilong, >>> >>> When merging segments, the dimension-measure values (k-v) will be >>> re-orged and the same key will be merged, so the merged size is not simply >>> a sum of each segment; usually, it is smaller than before. >>> >>> Always using the statistics to estimate the size is for consistency. Of >>> course, there is room to improve the estimation accuracy. >>> >>> >>> >>> 2018-01-22 16:54 GMT+08:00 苏启龙 <suqil...@qiyi.com>: >>> >>>> >>>> Hi, >>>> >>>> We have some unclear points about the segment size estimate when >>>> merging multi-segments. >>>> >>>> We find that the segment merge job still uses >>>> CubeStatsReader::getCuboidSizeMap to estimate the total size of the >>>> merged segment. From our understanding, when building a new segment, Kylin >>>> uses this way to estimate the total size is OK since no other info we can >>>> turn to. But in merging we may sum the table size of the segments to be >>>> merged, which should be more accurate. >>>> >>>> So why for this consideration? >>>> >>>> >>>> >>>> Su Qilong >>>> >>> >>> >>> >>> -- >>> Best regards, >>> >>> Shaofeng Shi 史少锋 >>> >>> >> >> >> -- >> Best regards, >> >> Shaofeng Shi 史少锋 >> >> > > > -- > Best regards, > > Shaofeng Shi 史少锋 > >