Based on your query plan, the skip scan is being done solely based on your salt bucket while the rest of the filtering is being done by a filter, which means that you're not filtering based on the leading part of your primary key. We'll know more once you post your schema, but if NETWORK, KEYWORD_ID and CUSTOMER_ID formed your primary key constraint, then the skip scan would work well.
Thanks, James On Mon, Feb 23, 2015 at 5:24 PM, James Taylor <[email protected]> wrote: > Hi Gary, > Would you mind posting your schema and query as well? > Thanks, > James > > On Mon, Feb 23, 2015 at 5:08 PM, Gary Schulte <[email protected]> > wrote: > >> I am currently evaluating Phoenix 4.2.2 for an analytic workload. >> Several of the test cases are simple aggregation over varying cardinality >> sets. For example, for one of the test aggregation queries I am seeing >> response times along the lines of: >> >> fact query cardinality response time using primary index 1k 0.3 3k >> 0.3 5k 0.4 10k 0.4 15k 0.7 20k 0.7 50k 2.2 100k 2.7 200k 5.3 400k >> 13.8 800k 20.1 1.5 mil 48.8 >> where the plan looks like: >> +------------------------------------------+ >> | PLAN | >> +------------------------------------------+ >> | CLIENT 81-CHUNK PARALLEL 10-WAY SKIP SCAN ON 10 RANGES OVER >> PERF.BIG_OLAP_DOC [0,10724,3600,1] - [9,10724,3685,1] | >> | SERVER FILTER BY (NETWORK = 'SEARCH' AND true AND CUSTOMER_ID = 545 >> AND KEYWORD_ID IN >> (613213594,613214433,613216496,613216861,613217357,613218757,613218770,613218833,613218871,613218937,613218938,613220926,613239657,61 >> | >> | SERVER AGGREGATE INTO SINGLE ROW | >> +------------------------------------------+ >> >> >> The performance is great for low cardinality sets, but as the cardinality >> of the set to be aggregated grows, it appears the response time scales >> essentially linearly with the size of the set being queried/aggregated. >> This is a roughly 100 million row 10-bucket salted table on a 10-node >> cluster >> >> I had presumed the value lookup might be the major cost factor, but if I >> remove all columns from the projection and criteria that are not part of >> the key, I only see about 20% reduced response time. >> >> +------------------------------------------+ >> | PLAN | >> +------------------------------------------+ >> | CLIENT 101-CHUNK PARALLEL 10-WAY SKIP SCAN ON 10 RANGES OVER >> PERF.BIG_OLAP_DOC [0,10724,3000,1] - [9,10724,3685,1] | >> | SERVER FILTER BY FIRST KEY ONLY AND (true AND KEYWORD_ID IN >> (613213594,613214433,613216496,613216861,613217357,613218757,613218770,613218833,613218871,613218937,613218938,613220926,613239657,613239663,613239668,61323969 >> | >> | SERVER AGGREGATE INTO SINGLE ROW | >> +------------------------------------------+ >> >> So it seems like the parallel index skip scan is the major factor. Is >> this expected? >> >> Insight about the expected performance and behavior of high cardinality >> queries and tips about optimizing for them would be greatly appreciated. >> >> >> TIA >> > >
