Hi dear kylin users and develop team: Here have some things I want to discuss with community. As a representative of MOLAP engine, kylin uses pre-aggregation strategies to provide high-concurrency and second-level response analysis capabilities, but also loses some flexibility. The limitation that purge existing segment firstly to add an additional measure will cause many double calculation and unnecessary disk IO. Such waste should be avoid especially in MOLAP engine. For example, there is an cubeA with one measure m1 and segments over time range1(tr1). Now, user add one measure m2, but don't want to clear segments over tr1. The value of m2 will exist in tr2, the segments build subsequently. Sure, tr1 doesn't contain value of m2, which will be understanded by user who know litte about MOLAP. Querying over tr1 and tr2 is valid for both m1 and m2, but the result of m2 over tr1 will be null. It's will be better to reminder user the measure missing.Moreover, refreshing will supply the m2 to segments over tr1. Currently, kylin's storage engine uses HBase. The measure are aggregated values based on combination of various dimension members and stored in a column of a Column Family in HBase. For the same cube, adding a new measure will add a column to the HBase table(mapping) and will take effect in the next build. For the existing HTables(segments), the new column is allowed to be missing. Refreshing old existing segments will add a new column in their HTable to store new measure. Value of new measure is aggregated according to the combination of dimension members in rowkey, without recalculating existing measure. Now, For additional measure and even additional dimensions, Kylin's current solution is Hybrid, but we found the following shortcomings during use: 1. Management costs: Repeated maintenance of similar Cubes, most of which have many intersections of dimensions and indicators. If you want to perform optimization operations such as pruning, you need to configure all of these cubes. 2. A large number of cubes: The initial analysis of the business is not stable, and analysts often have the need to increase some measures. The cube is added continuously to the Hybrid group, which will produce a lot of cubes. 3. Repeat calculation: If you want to drop the old cube in the Hybrid group, you need to build the latest cube by compute historical data to cover the old cube. Those will result in a lot of waste. In addition, I felt that the metadata about the measure was not perfect during the applying of Kylin. 1. As one of the most important concerns of analysts, if the measures of the analysis system can be decoupled from the materialized view(cube) and have their own management system, it may be more flexibility. 2. Once the dimensions have been choose in cube designing, it's cuboids are confirmed no matter the number of measures. It may make confuse to maintenance cubes with different measures but same cuboids. Cubes with different cuboids should be considered different cube, which is the definition of cube, isn't it? It's just some thinking about MOLAP during I using kylin. How do you think about this? Looking forward your reply, sincerely. Maybe here are some mistake or misunderstanding, please feel free to correct me or discuss further more if you find any of them. Best regards yuzhang
| | yuzhang | | [email protected] | 签名由网易邮箱大师定制
