Hi Qilong, If seg A's estimation size is 10 GB, but real size is 5 GB; then when merge or build another segment, we can adjust the estimated size by divide by 2. Then it should be closer with real size.
2018-01-24 9:49 GMT+08:00 苏启龙 <[email protected]>: > Many thanks shaofeng! We’ll check more on these parameters to see how to > make it better. > > BTW, what do u mean by the last line? I mean by which way I can introduce > the actual size to help Kylin to adjust the estimation? Currently I can > only use the max-regions parameter manually, but this is not convenient for > auto-merging. > > QIlong > > 发件人: ShaoFeng Shi <[email protected]> > 答复: "[email protected]" <[email protected]> > 日期: 2018年1月23日 星期二 21:49 > > 至: user <[email protected]> > 抄送: 林豪(linhao)-技术产品中心 <[email protected]> > 主题: Re: segment size estimate when merging > > Hi Qilong, > > Does your cube have count-distinct or Top-N measure? > > If you observed that there are too many or too small hbase regions, you > can adjust some parameters: > > kylin.cube.size-estimate-ratio=0.25 > kylin.cube.size-estimate-countdistinct-ratio=0.05 > > The default ratio for common case is 0.25, you can set it to smaller if > the estimated size is bigger than actual size. These two parameters can be > set at Cube level. > > A better way is when doing merge, using the actual size of existing > segments to adjust the estimated size, then get a closer result. > > 2018-01-23 14:47 GMT+08:00 苏启龙 <[email protected]>: > >> Hi shaofeng, >> >> Yes, it’s usually smaller then the sum of each segment, but usually a >> small amount compared with the total size. >> >> But for the statistics estimate, usually result in a N times larger than >> it actually be, and results in a huge waste of HBase region numbers。 >> >> >> 1. Do you have any data about deviation of the two ways in >> statistics? I mean generally which way will be closer? >> 2. Is there any improve plan for this in the roadmap? Or some >> consideration to give more options to user to select their own estimate >> algo? >> >> >> Thanks >> >> Qilong >> >> 发件人: ShaoFeng Shi <[email protected]> >> 答复: "[email protected]" <[email protected]> >> 日期: 2018年1月23日 星期二 09:43 >> 至: user <[email protected]> >> 抄送: 林豪(linhao)-技术产品中心 <[email protected]> >> 主题: Re: segment size estimate when merging >> >> Hi Qilong, >> >> When merging segments, the dimension-measure values (k-v) will be >> re-orged and the same key will be merged, so the merged size is not simply >> a sum of each segment; usually, it is smaller than before. >> >> Always using the statistics to estimate the size is for consistency. Of >> course, there is room to improve the estimation accuracy. >> >> >> >> 2018-01-22 16:54 GMT+08:00 苏启龙 <[email protected]>: >> >>> >>> Hi, >>> >>> We have some unclear points about the segment size estimate when merging >>> multi-segments. >>> >>> We find that the segment merge job still uses >>> CubeStatsReader::getCuboidSizeMap to estimate the total size of the >>> merged segment. From our understanding, when building a new segment, Kylin >>> uses this way to estimate the total size is OK since no other info we can >>> turn to. But in merging we may sum the table size of the segments to be >>> merged, which should be more accurate. >>> >>> So why for this consideration? >>> >>> >>> >>> Su Qilong >>> >> >> >> >> -- >> Best regards, >> >> Shaofeng Shi 史少锋 >> >> > > > -- > Best regards, > > Shaofeng Shi 史少锋 > > -- Best regards, Shaofeng Shi 史少锋
