Thanks for the clarifications, Nick. That's a cool idea for cube building - I'm not aware of any JIRAs for that.
FYI, for approximate count, we have PHOENIX-418 which Ravi is working on. I think he was looking at using a HyperLogLog library, but perhaps BlinkDB is an alternative. On Thu, Apr 14, 2016 at 2:01 PM, Nick Dimiduk <ndimi...@gmail.com> wrote: > The stats table would purely be used to drive optimizer decisions in >> Phoenix. The data in the table is only collected during major compaction >> (or when an update stats is run manually), so it's not really meant for >> satisfying queries. >> >> For Kylin integration, we'd rely on Kylin to maintain the cubes and >> Calcite would be the glue that allows both Phoenix and Kylin to cooperate >> at planning time. I'm sure there'd be other runtime pieces required to make >> it work. >> > > Understood. I'm not talking about query time. As I understand Kylin's > current state, it builds cubes from data in Hive tables conforming to a > star schema. My thinking is for an end-to-end Phoenix-driven data store, > where Kylin uses data stored in Phoenix as the source for building the > cubes. We don't store data in this schema structure in Phoenix, so > cube-building could be optimized by Phoenix's own stats table, instead of > cardinality queries running against Hive. In this deployment scenario, I > see no place for Hive at all. > > I have no idea on the feasibility of BlinkDB integration, but conceptually >> BlinkDB could probably be used as a statistics provider for Phoenix. >> > > I'm not talking about integration. I'm suggesting phoenix could support an > 'approximate count' operator that generated a result based on queries to > the stats table. "Roughly how many rows are in this table?" Given the cost > of an actual row count, this would be a useful functionality to provide. > > On Thu, Apr 14, 2016 at 1:05 PM, Nick Dimiduk <ndimi...@gmail.com> wrote: >> >>> Ah, okay. Thanks for the pointer to PHOENIX-1178. Do you think the >>> stats table is the right place for this kind of info? Seems like the only >>> choice. Is there a plan to make the stats table a stable internal API? For >>> instance, integration with Kylin for building Cubes off of denormalized >>> event tables in Phoenix, or supporting BlinkDB approximation queries could >>> both be facilitated by the stats table. >>> >>> -n >>> >>> On Thu, Apr 14, 2016 at 12:24 PM, James Taylor <jamestay...@apache.org> >>> wrote: >>> >>>> FYI, Lars H. is looking at PHOENIX-258 for improving performance of >>>> DISTINCT. We don't yet keep any cardinality info in our stats >>>> (see PHOENIX-1178). >>>> >>>> Thanks, >>>> James >>>> >>>> On Thu, Apr 14, 2016 at 11:22 AM, Nick Dimiduk <ndimi...@gmail.com> >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> I'm curious if there are any tricks for estimating the cardinality of >>>>> the values in a phoenix column. Even for leading rowkey column, a select >>>>> distinct query on a large table requires a full scan (PHOENIX-258). Maybe >>>>> one could reach into the stats table and derive some knowledge? How much >>>>> of >>>>> a "bad thing" would this be? >>>>> >>>>> Thanks, >>>>> Nick >>>>> >>>> >>>> >>> >> >