> > The stats table would purely be used to drive optimizer decisions in > Phoenix. The data in the table is only collected during major compaction > (or when an update stats is run manually), so it's not really meant for > satisfying queries. > > For Kylin integration, we'd rely on Kylin to maintain the cubes and > Calcite would be the glue that allows both Phoenix and Kylin to cooperate > at planning time. I'm sure there'd be other runtime pieces required to make > it work. >
Understood. I'm not talking about query time. As I understand Kylin's current state, it builds cubes from data in Hive tables conforming to a star schema. My thinking is for an end-to-end Phoenix-driven data store, where Kylin uses data stored in Phoenix as the source for building the cubes. We don't store data in this schema structure in Phoenix, so cube-building could be optimized by Phoenix's own stats table, instead of cardinality queries running against Hive. In this deployment scenario, I see no place for Hive at all. I have no idea on the feasibility of BlinkDB integration, but conceptually > BlinkDB could probably be used as a statistics provider for Phoenix. > I'm not talking about integration. I'm suggesting phoenix could support an 'approximate count' operator that generated a result based on queries to the stats table. "Roughly how many rows are in this table?" Given the cost of an actual row count, this would be a useful functionality to provide. On Thu, Apr 14, 2016 at 1:05 PM, Nick Dimiduk <ndimi...@gmail.com> wrote: > >> Ah, okay. Thanks for the pointer to PHOENIX-1178. Do you think the stats >> table is the right place for this kind of info? Seems like the only choice. >> Is there a plan to make the stats table a stable internal API? For >> instance, integration with Kylin for building Cubes off of denormalized >> event tables in Phoenix, or supporting BlinkDB approximation queries could >> both be facilitated by the stats table. >> >> -n >> >> On Thu, Apr 14, 2016 at 12:24 PM, James Taylor <jamestay...@apache.org> >> wrote: >> >>> FYI, Lars H. is looking at PHOENIX-258 for improving performance of >>> DISTINCT. We don't yet keep any cardinality info in our stats >>> (see PHOENIX-1178). >>> >>> Thanks, >>> James >>> >>> On Thu, Apr 14, 2016 at 11:22 AM, Nick Dimiduk <ndimi...@gmail.com> >>> wrote: >>> >>>> Hello, >>>> >>>> I'm curious if there are any tricks for estimating the cardinality of >>>> the values in a phoenix column. Even for leading rowkey column, a select >>>> distinct query on a large table requires a full scan (PHOENIX-258). Maybe >>>> one could reach into the stats table and derive some knowledge? How much of >>>> a "bad thing" would this be? >>>> >>>> Thanks, >>>> Nick >>>> >>> >>> >> >