Re: Column Cardinality and Stats table as an "interface"

Nick Dimiduk Thu, 14 Apr 2016 14:02:01 -0700

>
> The stats table would purely be used to drive optimizer decisions in
> Phoenix. The data in the table is only collected during major compaction
> (or when an update stats is run manually), so it's not really meant for
> satisfying queries.
>
> For Kylin integration, we'd rely on Kylin to maintain the cubes and
> Calcite would be the glue that allows both Phoenix and Kylin to cooperate
> at planning time. I'm sure there'd be other runtime pieces required to make
> it work.
>

Understood. I'm not talking about query time. As I understand Kylin's
current state, it builds cubes from data in Hive tables conforming to a
star schema. My thinking is for an end-to-end Phoenix-driven data store,
where Kylin uses data stored in Phoenix as the source for building the
cubes. We don't store data in this schema structure in Phoenix, so
cube-building could be optimized by Phoenix's own stats table, instead of
cardinality queries running against Hive. In this deployment scenario, I
see no place for Hive at all.

I have no idea on the feasibility of BlinkDB integration, but conceptually
> BlinkDB could probably be used as a statistics provider for Phoenix.
>

I'm not talking about integration. I'm suggesting phoenix could support an
'approximate count' operator that generated a result based on queries to
the stats table. "Roughly how many rows are in this table?" Given the cost
of an actual row count, this would be a useful functionality to provide.

On Thu, Apr 14, 2016 at 1:05 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:
>
>> Ah, okay. Thanks for the pointer to PHOENIX-1178. Do you think the stats
>> table is the right place for this kind of info? Seems like the only choice.
>> Is there a plan to make the stats table a stable internal API? For
>> instance, integration with Kylin for building Cubes off of denormalized
>> event tables in Phoenix, or supporting BlinkDB approximation queries could
>> both be facilitated by the stats table.
>>
>> -n
>>
>> On Thu, Apr 14, 2016 at 12:24 PM, James Taylor <jamestay...@apache.org>
>> wrote:
>>
>>> FYI, Lars H. is looking at PHOENIX-258 for improving performance of
>>> DISTINCT. We don't yet keep any cardinality info in our stats
>>> (see PHOENIX-1178).
>>>
>>> Thanks,
>>> James
>>>
>>> On Thu, Apr 14, 2016 at 11:22 AM, Nick Dimiduk <ndimi...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm curious if there are any tricks for estimating the cardinality of
>>>> the values in a phoenix column. Even for leading rowkey column, a select
>>>> distinct query on a large table requires a full scan (PHOENIX-258). Maybe
>>>> one could reach into the stats table and derive some knowledge? How much of
>>>> a "bad thing" would this be?
>>>>
>>>> Thanks,
>>>> Nick
>>>>
>>>
>>>
>>
>

Re: Column Cardinality and Stats table as an "interface"

Reply via email to