1、the user_id is unique string id,but now we can't get user_id set from kylin,right?
在 2017-12-07 09:57:31,ShaoFeng Shi <[email protected]> 写道: Hi Miao, For 1, Kylin is focusing on OLAP scenarios, so most queries are aggregated query instead of detail query. But your scenario is a case that bitmap can fit, if the result set isn't big, it is doable. Only need to decouple the bitmap values (if the user id is integer family, no need to decode with dictionary). This is something like the TopN measure. For 2, yes the global dictionary will grow as user number grows. For 3, If you use Kylin 2.1, the cube data, as well as metadata, will all on HBase cluster. Before Kylin 2.1, there is an issue will cause some metadata file will be left on the hive cluster. With whatever deployment topology, we suggest you backup the metadata periodically to minimize the data loss possibility. 2017-12-06 9:45 GMT+08:00 崔苗 <[email protected]>: 1、we have four data node:us,shenzhen-china,hongkong-china and eu,every data node has one MySql database,we want to deploy four kylin cluster to anlyse the data and merge the result to get the final result , so we need the distinct user set in every data node and merge it to get rid of duplicated users. It seems it's not a good scenarios for kylin. 2、If we want to get the count distinct on string column,such as user ID, it's a high cardinality column,how to estimate the memory that the global dict need? Will kylin expand the global dict and the bitmap about users if users increase every day? 3、If we deploy kylin with standalone hbase cluster , does all the data about result ,such as dict , bitmap will be stored in the hbase cluster ? so we don't need to set HA mode on the other hadoop cluster(hive+spark) because the data loss in this cluster will not damage the result , we just need to ensure the high availability on the hbase cluster? 在 2017-12-06 08:41:13,ShaoFeng Shi <[email protected]> 写道: Hi Miao, 1. Currently, Kylin only returns the count in the bitmap, not IDs in it; It should be able to extend. Could you please describe your scenarios? 2. Yes, the Cube API will return each segment of the cube, and each segment has a start date and end date. Please check Kylin's Rest API document. 2017-12-05 18:31 GMT+08:00 崔苗 <[email protected]>: 1、If there is Bitmap stored in hbase,can we get the distinct user set if we need to know all the distinct users? 2、Is there any restuful api could get the cube's date_time,date_range_start and date_range_end? 在 2017-11-30 18:30:27,ShaoFeng Shi <[email protected]> 写道: Hi Miao, Kylin use HyperLogLog or Bitmap to persistent the distinct values; You can get some info from this blog: https://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/ 2017-11-30 9:25 GMT+08:00 崔苗 <[email protected]>: Hi, we want to get count(distinct user) group by hour/day/week/month/year,now we have a problem: what's the content of count(distinct user) that kylin keeps,the distinct users set or just a count number? If we want to count (distinct user) by year,do we need to keep data for a year in hive? -- Best regards, Shaofeng Shi 史少锋 -- Best regards, Shaofeng Shi 史少锋 -- Best regards, Shaofeng Shi 史少锋
