A -I send you a bug: (sorry) Try to use *JOINT* DIM with very low cardinality columns, perhaps: TypeID, NetworkID, LanguajeID, IsMovileDevice.
B- you put "DATESTATS" as mandatory, then I Imagine you put 1º position of RowKey: Be careful with this. Because you are creating Hot RegionsServer and Cold RS 2017-01-17 9:26 GMT+01:00 Alberto Ramón <[email protected]>: > Did you compressed the output cube? This is very important (see last link) > > About Order BY > - Check if TopN can solve your problem: > http://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/ > - Try to reorder RowKey to put OrderBY in first possitions > - Try AGG : Make a "sub-cube" with less Dim > http://kylin.apache.org/blog/2016/02/18/new-aggregation-group/ > > 2017-01-17 7:50 GMT+01:00 Phong Pham <[email protected]>: > >> Hi Alberto, >> After try to apply your suggestion, our queríe is improved so much. >> Thanks a lot. >> However, we have problem with ORDER BY function. When we use ORDER BY >> with a large data set (for example: with long date-range filter), >> performance is very slow. >> Result: >> *User: ADMIN* >> *Success: true* >> *Duration: 23.311* >> *Project: metrixa_global_database_new* >> *Realization Names: [account_global_convtrack_summary_daily_by_location]* >> *Cuboid Ids: [135]* >> *Total scan count: 2595584* >> *Result row count: 250* >> *Accept Partial: true* >> *Is Partial Result: false* >> *Hit Exception Cache: false* >> *Storage cache used: false* >> *Message: null* >> >> ORDER BY performance goes down when Total Scan Count is big. So how can i >> improve this problem? >> Thanks >> >> >> 2017-01-16 18:45 GMT+07:00 Alberto Ramón <[email protected]>: >> >>> Hi Phon, I'm not expert but I have some suggestions: >>> >>> - All Dim en are using Dict: you can change a lot to Integer (Fix >>> length) >>> - Re-Order row key its a good idea. I always try to first fields of key >>> have Fix Length. Put mandatory the First its a good Idea >>> - See hierarchy optimizations, will be very interesting for you: >>> Country, Region, City, site . Perhaps Company and Account also can be >>> included (I don't know your data) >>> - If you use Left join, the first step of building cube (flat table) >>> will be more slow >>> - Check if your ORC input table is compressed >>> - Try to use derived DIm with very low cardinality columns, perhaps: >>> TypeID, NetworkID, LanguajeID, IsMovileDevice. >>> I understand that Affiliated, Account, Company, ... will growth in >>> the future, because you are working with test data ? >>> >>> Check this references: >>> http://kylin.apache.org/docs/howto/howto_optimize_cubes.html >>> http://mail-archives.apache.org/mod_mbox/kylin-user/201611.mbox >>> /%3Ctencent_F5A1E061EFFB778CC5BF9909%40qq.com%3E >>> http://mail-archives.apache.org/mod_mbox/kylin-user/201607.mbox >>> /%3C004201d1d4ef%240151b7e0%2403f527a0%24%40fishbowl.com%3E >>> http://mail-archives.apache.org/mod_mbox/kylin-user/201612.mbox >>> /%3CCAEcyM171RGhk0QoXJUjjZJeSxXwgUGu0vO%2B_T71KXMU1k00L%2Bg% >>> 40mail.gmail.com%3E >>> Check this tunning example: https://github.com/albertoRamon/Kylin >>> /tree/master/KylinPerformance >>> >>> BR, Alb >>> >>> >>> 2017-01-16 3:47 GMT+01:00 Phong Pham <[email protected]>: >>> >>>> Hi all, >>>> Hi all, >>>> * We still meet problems with query performance. Here is the cube >>>> info of one cube*: >>>> { >>>> "uuid": "6b2f4643-72a3-4a51-b9f2-47aa8e1322a5", >>>> "last_modified": 1484533219336, >>>> "version": "1.6.0", >>>> "name": "account_global_convtrack_summary_daily_test", >>>> "owner": "ADMIN", >>>> "descriptor": "account_global_convtrack_summary_daily_test", >>>> "cost": 50, >>>> "status": "READY", >>>> "segments": [ >>>> { >>>> "uuid": "85fa970e-6808-47c8-ae35-45d1975bb3bc", >>>> "name": "20160101000000_20161226000000", >>>> "storage_location_identifier": "KYLIN_7E4KIJ3YGX", >>>> "date_range_start": 1451606400000, >>>> "date_range_end": 1482710400000, >>>> "source_offset_start": 0, >>>> "source_offset_end": 0, >>>> "status": "READY", >>>> "size_kb": 9758001, >>>> "input_records": 8109122, >>>> "input_records_size": 102078756, >>>> "last_build_time": 1484533219335, >>>> "last_build_job_id": "a4f67403-17cb-4474-84d1-21ad64ed17a8", >>>> "create_time_utc": 1484527504660, >>>> "cuboid_shard_nums": {}, >>>> "total_shards": 4, >>>> "blackout_cuboids": [], >>>> "binary_signature": null, >>>> "dictionaries": { >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/CITYID": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/CITYID/0015e15c-9336-4040-b8ad-b7afba71d51c.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/TYPE": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/TYPE/56cc3576-3c19-40fb-8704-29dba88e3511.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/NETWORKID": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/NETWORKID/edc1b900-8b8a-4834-a8ab-4d23e0087d61.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/WEEKGROUP": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/WEEKGROUP/3c3ae7e2-05a0-49a3-b396-ded7b1faaebd.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/DATESTATSBIGINT": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/DATESTATSBIGINT/b2003335-f10c-48b5-ac98-6d2ddd >>>> 25854b.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/COUNTRYID": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/COUNTRYID/233a3b35-9e0f-46e3-bb01-3330c907ab33.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/ACCOUNTID": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/ACCOUNTID/612d8a57-8ed8-4fdd-bf99-c64fb2a583fe.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/DEVICEID": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/DEVICEID/8813544c-aac3-4f26-849b-3e3d1b71d9e2.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/LANGUAGEID": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/LANGUAGEID/02dea027-86cf-44e6-9bcf-9dbd4c33e54b.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/COMPANYID": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/COMPANYID/75a5566e-b419-4fc8-9184-757b207a35d2.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/REGIONID": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/REGIONID/81d5b463-8639-4633-83b9-9ac9e43e32cb.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/AFFILIATEID": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/AFFILIATEID/0a35d5ce-dabb-4e32-ad5f-b87ef4c18ee3.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/SITEID": >>>> "/dict/MTX_SYSTEM.TBL_CONVTRACK_SITES_ORC/SITEID/07e4f091-f6 >>>> aa-4520-9069-416ee4c904de.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/MONTHGROUP": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/MONTHGROUP/e3bf45aa-3ff3-477b-aafd-d2c38a70caea.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/DATESTATS": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/DATESTATS/5a3d3dc6-90eb-493b-84d0-b1b8ca8b70ec.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/ISMOBILEDEVICE": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/ISMOBILEDEVICE/eba9f8db-c5f0-4283-8a77-5f72d75 >>>> c5867.dict", >>>> "METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMMARY_DAILY_ORC/SOURCEURLID": >>>> "/dict/METRIXA_GLOBAL_DATABASE.ACCOUNT_GLOBAL_CONVTRACK_SUMM >>>> ARY_DAILY_ORC/SOURCEURLID/3f90d0de-6d04-4bc6-af20-0030a91326f0.dict" >>>> }, >>>> "snapshots": { >>>> "MTX_SYSTEM.TBL_MCM_COUNTRY_CITY_ORC": "/table_snapshot/MTX_SYSTEM.TB >>>> L_MCM_COUNTRY_CITY_ORC/f32ec683-f83f-423a-820e-1bfd4b65426f.snapshot", >>>> "METRIXA_GLOBAL_DATABASE.GLOBAL_SOURCEURL_ORC": >>>> "/table_snapshot/METRIXA_GLOBAL_DATABASE.GLOBAL_SOURCEURL_OR >>>> C/32e8df3f-7188-4646-9eff-6c96792897f4.snapshot", >>>> "MTX_SYSTEM.TBL_MCM_COUNTRY_REGION_ORC": "/table_snapshot/MTX_SYSTEM.TB >>>> L_MCM_COUNTRY_REGION_ORC/e4378b9c-ff08-4207-92fa-3f0cf37f00d >>>> 5.snapshot", >>>> "MTX_SYSTEM.TBL_MCM_COUNTRY_ORC": "/table_snapshot/MTX_SYSTEM.TB >>>> L_MCM_COUNTRY_ORC/2f2ffb19-d675-43a2-bb08-66a83801f875.snapshot", >>>> "MTX_SYSTEM.GLOBAL_ACCOUNT_SEARCH_ENGINE_ORC": "/table_snapshot/ >>>> MTX_SYSTEM.GLOBAL_ACCOUNT_SEARCH_ENGINE_ORC >>>> /53ef6022-7249-4ef8-8518-b7d84c65fdfa.snapshot", >>>> "MTX_SYSTEM.TBL_CONVTRACK_SITES_ORC": "/table_snapshot/MTX_SYSTEM.TB >>>> L_CONVTRACK_SITES_ORC/0cbb0323-d434-44de-8891-85b024589743.snapshot", >>>> "MTX_SYSTEM.TBL_MCM_LANGUAGE_ORC": "/table_snapshot/MTX_SYSTEM.TB >>>> L_MCM_LANGUAGE_ORC/957e6a54-c618-4e5c-bc8d-c89952cafe1e.snapshot", >>>> "MTX_SYSTEM.TBL_CONVTRACK_AFFILIATES_ORC": >>>> "/table_snapshot/MTX_SYSTEM.TBL_CONVTRACK_AFFILIATES_ORC/f79 >>>> 4bce2-dcb1-41b0-b9bf-fe3c9e1ad661.snapshot" >>>> }, >>>> "index_path": "/kylin/kylin_metadata/kylin-a >>>> 4f67403-17cb-4474-84d1-21ad64ed17a8/account_global_convtrack >>>> _summary_daily_clone/secondary_index/", >>>> "rowkey_stats": [ >>>> [ >>>> "DATESTATS", >>>> 360, >>>> 2 >>>> ], >>>> [ >>>> "CITYID", >>>> 60804, >>>> 2 >>>> ], >>>> [ >>>> "SOURCEURLID", >>>> 38212, >>>> 2 >>>> ], >>>> [ >>>> "REGIONID", >>>> 2822, >>>> 2 >>>> ], >>>> [ >>>> "COUNTRYID", >>>> 238, >>>> 1 >>>> ], >>>> [ >>>> "LANGUAGEID", >>>> 173, >>>> 1 >>>> ], >>>> [ >>>> "AFFILIATEID", >>>> 36, >>>> 1 >>>> ], >>>> [ >>>> "ACCOUNTID", >>>> 62, >>>> 1 >>>> ], >>>> [ >>>> "COMPANYID", >>>> 19, >>>> 1 >>>> ], >>>> [ >>>> "SITEID", >>>> 103, >>>> 1 >>>> ], >>>> [ >>>> "WEEKGROUP", >>>> 52, >>>> 1 >>>> ], >>>> [ >>>> "MONTHGROUP", >>>> 12, >>>> 1 >>>> ], >>>> [ >>>> "TYPE", >>>> 2, >>>> 1 >>>> ], >>>> [ >>>> "ISMOBILEDEVICE", >>>> 2, >>>> 1 >>>> ], >>>> [ >>>> "DEVICEID", >>>> 338, >>>> 2 >>>> ], >>>> [ >>>> "NETWORKID", >>>> 161, >>>> 1 >>>> ], >>>> [ >>>> "DATESTATSBIGINT", >>>> 360, >>>> 2 >>>> ] >>>> ] >>>> } >>>> ], >>>> "create_time_utc": 1484286587541, >>>> "size_kb": 9758001, >>>> "input_records_count": 8109122, >>>> "input_records_size": 102078756 >>>> } >>>> *+ We have 2 colums that is high cardinality*: [ >>>> "CITYID", >>>> 60804, >>>> 2 >>>> ], >>>> [ >>>> "SOURCEURLID", >>>> 38212, >>>> 2 >>>> ], >>>> *+ We define left-join from model for all relations* >>>> *+ With new aggregation:* >>>> Includes >>>> ["SITEID","COMPANYID","SOURCEURLID","DATESTATS","WEEKGROUP", >>>> "MONTHGROUP","COUNTRYID","REGIONID","TYPE","ISMOBILEDEVICE", >>>> "LANGUAGEID","DEVICEID","NETWORKID","ACCOUNTID","AFFILIATEID","CITYID"] >>>> >>>> Mandatory Dimensions >>>> ["DATESTATS"]: Because we always use datestats as a filter >>>> >>>> Hierarchy Dimensions: None < Maybe wee will put CountryId, RegionId, >>>> and CityId >>>> Joint Dimensions >>>> ["LANGUAGEID","ACCOUNTID","AFFILIATEID","SITEID","CITYID","R >>>> EGIONID","COUNTRYID","SOURCEURLID"]: Please explain to me more about >>>> join dimensions? I don't understand fully about this theory. >>>> *+ Rowkeys:* >>>> We defined all rows is dict, because all of them are not ultra high >>>> cardinality >>>> >>>> The query that is very slow is that: >>>> + We get all dims and metrics, left join all dim tables and group by >>>> all dims >>>> + We set datetstats condition for 1 year >>>> >>>> And query often take a long time to executed: >10s >>>> >>>> So are there problems with our cube design? I would like to hear your >>>> reply soon. >>>> Thanks so much for your help. >>>> >>>> 2017-01-12 21:28 GMT+07:00 ShaoFeng Shi <[email protected]>: >>>> >>>>> Obviously there are too many segments (24*3=72), try to merge them as >>>>> Billy suggested. >>>>> >>>>> Secondly if possible try to review and optimize the cube design >>>>> (especially the rowkey sequence, put high-cardinality filter column to the >>>>> begin position to minimal the scan range), see >>>>> http://www.slideshare.net/YangLi43/design-cube-in-apache-kylin >>>>> >>>>> Thirdly try to give more power to the cluster, e.g use physical >>>>> machines; and also use multiple kylin query nodes to balance the >>>>> concurrent >>>>> work load. >>>>> >>>>> Just some cents, hope it can help. >>>>> >>>>> 2017-01-12 22:16 GMT+08:00 Billy Liu <[email protected]>: >>>>> >>>>>> I have concerns with so many segments. Please try query only one cube >>>>>> with one segment first. >>>>>> >>>>>> 2017-01-12 13:36 GMT+08:00 Phong Pham <[email protected]>: >>>>>> >>>>>>> Hi, >>>>>>> Thank you so much for your help. I really appreciate it. Im really >>>>>>> impressed with your project and trying to apply it to our product. Our >>>>>>> live >>>>>>> product is still working on Mysql and MongoDb, but data is growing fast. >>>>>>> That's why we need your product for the database engine replacement. >>>>>>> About our problem with many queries on same time on Apache Kylin, >>>>>>> I'm trying to monitor some elements on our system and review cubes. So >>>>>>> are >>>>>>> there some tutorials about concurrency of Kylin or HBase? >>>>>>> I will give you more details abour our system: >>>>>>> Hardware: >>>>>>> 2 physical machines -> 7 vitural machines >>>>>>> Each vitural machine: >>>>>>> CPU: 8cores >>>>>>> RAM: 24GB >>>>>>> We are setup hadoop env with hortonwork 2.5 and setup HBase with 5 >>>>>>> RegionServer, 2 Hbase masters >>>>>>> Apahce Kylin we setup on 2 machines: >>>>>>> + Node 1: using for build cubes >>>>>>> + Node 2: using for only queries (this node also contain >>>>>>> RegionServer) >>>>>>> Cube and Queries: >>>>>>> + Size of Cubes: >>>>>>> - Cube 1: 20GB/14M rows - 24 segments (maybe we need to meger them >>>>>>> into 2-3 segments) >>>>>>> - Cube 2: 460MB/3M rows - 24 segments >>>>>>> - Cube 3: 1.3GB/1.4M rows - 24 segments >>>>>>> + We use one query to read data from 3 cubes and union all into 1 >>>>>>> result >>>>>>> Test case: >>>>>>> + On single request: 3s >>>>>>> + On 5 requests on same times: (submit multi-requests from client): >>>>>>> 20s/request >>>>>>> And that is not acceptable when we go live. >>>>>>> So hope you all review our struture and give us some best pratices >>>>>>> with Kylin And Hbase. >>>>>>> Thanks >>>>>>> >>>>>>> 2017-01-12 8:24 GMT+07:00 ShaoFeng Shi <[email protected]>: >>>>>>> >>>>>>>> In this case you need do some profiling to see what's the >>>>>>>> bottleneck: Kylin or HBase or other factors like CPU, memory or >>>>>>>> network; >>>>>>>> maybe it is related with the cube design, try to optimize the cube >>>>>>>> design >>>>>>>> with the executed query is also a way; It is hard to give you good >>>>>>>> answer >>>>>>>> with a couple words. >>>>>>>> >>>>>>>> 2017-01-11 19:50 GMT+08:00 Phong Pham <[email protected]>: >>>>>>>> >>>>>>>>> Heres about detail on our system: >>>>>>>>> >>>>>>>>> Hbase: 5 nodes >>>>>>>>> Data size: 24M rows >>>>>>>>> >>>>>>>>> Query result: >>>>>>>>> *Success: true* >>>>>>>>> *Duration: 20s* >>>>>>>>> *Project: metrixa_global_database* >>>>>>>>> *Realization Names: [xxx, xxx, xxx]* >>>>>>>>> *Cuboid Ids: [45971, 24]* >>>>>>>>> >>>>>>>>> >>>>>>>>> 2017-01-11 18:34 GMT+07:00 Phong Pham <[email protected]>: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> I have a problem with concurrency on Apache Kylin. Execute >>>>>>>>>> single query, it takes about 3s. Howerver,when i run multiple >>>>>>>>>> queries on >>>>>>>>>> the same time, each query take about 13-15s. So how can i solve >>>>>>>>>> problems? >>>>>>>>>> My Kylin Version is 1.6.1 >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best regards, >>>>>>>> >>>>>>>> Shaofeng Shi 史少锋 >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Best regards, >>>>> >>>>> Shaofeng Shi 史少锋 >>>>> >>>>> >>>> >>> >> >
