Re: Column Cardinality and Stats table as an "interface"

James Taylor Thu, 14 Apr 2016 14:08:24 -0700

Thanks for the clarifications, Nick. That's a cool idea for cube building -
I'm not aware of any JIRAs for that.


FYI, for approximate count, we have PHOENIX-418 which Ravi is working on. I
think he was looking at using a HyperLogLog library, but perhaps BlinkDB is
an alternative.

On Thu, Apr 14, 2016 at 2:01 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:

> The stats table would purely be used to drive optimizer decisions in
>> Phoenix. The data in the table is only collected during major compaction
>> (or when an update stats is run manually), so it's not really meant for
>> satisfying queries.
>>
>> For Kylin integration, we'd rely on Kylin to maintain the cubes and
>> Calcite would be the glue that allows both Phoenix and Kylin to cooperate
>> at planning time. I'm sure there'd be other runtime pieces required to make
>> it work.
>>
>
> Understood. I'm not talking about query time. As I understand Kylin's
> current state, it builds cubes from data in Hive tables conforming to a
> star schema. My thinking is for an end-to-end Phoenix-driven data store,
> where Kylin uses data stored in Phoenix as the source for building the
> cubes. We don't store data in this schema structure in Phoenix, so
> cube-building could be optimized by Phoenix's own stats table, instead of
> cardinality queries running against Hive. In this deployment scenario, I
> see no place for Hive at all.
>
> I have no idea on the feasibility of BlinkDB integration, but conceptually
>> BlinkDB could probably be used as a statistics provider for Phoenix.
>>
>
> I'm not talking about integration. I'm suggesting phoenix could support an
> 'approximate count' operator that generated a result based on queries to
> the stats table. "Roughly how many rows are in this table?" Given the cost
> of an actual row count, this would be a useful functionality to provide.
>
> On Thu, Apr 14, 2016 at 1:05 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:
>>
>>> Ah, okay. Thanks for the pointer to PHOENIX-1178. Do you think the
>>> stats table is the right place for this kind of info? Seems like the only
>>> choice. Is there a plan to make the stats table a stable internal API? For
>>> instance, integration with Kylin for building Cubes off of denormalized
>>> event tables in Phoenix, or supporting BlinkDB approximation queries could
>>> both be facilitated by the stats table.
>>>
>>> -n
>>>
>>> On Thu, Apr 14, 2016 at 12:24 PM, James Taylor <jamestay...@apache.org>
>>> wrote:
>>>
>>>> FYI, Lars H. is looking at PHOENIX-258 for improving performance of
>>>> DISTINCT. We don't yet keep any cardinality info in our stats
>>>> (see PHOENIX-1178).
>>>>
>>>> Thanks,
>>>> James
>>>>
>>>> On Thu, Apr 14, 2016 at 11:22 AM, Nick Dimiduk <ndimi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm curious if there are any tricks for estimating the cardinality of
>>>>> the values in a phoenix column. Even for leading rowkey column, a select
>>>>> distinct query on a large table requires a full scan (PHOENIX-258). Maybe
>>>>> one could reach into the stats table and derive some knowledge? How much 
>>>>> of
>>>>> a "bad thing" would this be?
>>>>>
>>>>> Thanks,
>>>>> Nick
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Column Cardinality and Stats table as an "interface"

Reply via email to