Re: Efficient way to get the row count of a table

James Taylor Tue, 19 Dec 2017 17:39:02 -0800

The count would change when a major compaction is done. Back in 4.7, it may
have changed when a split occurs too (but this is no longer the case). I'd
recommend moving to a newer version: 4.7 was release almost two years ago
and is six releases back from the current 4.13 release. FWIW, we're getting
ready to release a 4.13 for CDH.


On Tue, Dec 19, 2017 at 4:27 PM, Jins George <[email protected]> wrote:

> Thank you James and Mujtaba  for the responses.
>
> I am OK with an estimate count . So using SYSTEM.STATS table looks good in
> my case. But when I tried this, it gave me inconsistent results.  for
> example.
>
> Time 1:
> select count(*) from myschema.mytable  => 3474085
> select sum(guide_posts_row_count) from system.stats where physical_name =
> 'myschema.device_data'  => 3348090
>
> Time 2 : ( Time1 + ~10 mins)
> select count(*) from myschema.mytable  => 3474106
> select sum(guide_posts_row_count) from system.stats where physical_name =
> 'myschema.device_data'  => 3348080
>
> So I was expecting the stats count to go up but surprisingly, the count
> went down.  Is there a specific configuration or something else that I am
> missing?
>
> I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.
>
> Thanks,
> Jins George
>
>
>
> On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
>
> Another alternate outside Phoenix is to use
> <http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/
> mapreduce/RowCounter.html M/R.
>
> On Tue, Dec 19, 2017 at 3:18 PM, James Taylor <[email protected]>
> wrote:
>
>> If it needs to be 100% accurate, then count(*) is the only way. If your
>> data is write-once data, you might be able to track the row count at the
>> application level through some kind of atomic counter in a different table
>> (but this will likely be brittle). If you can live with an estimate, you
>> could enable statistics [1], optionally configuring Phoenix not to use
>> stats for parallelization [2], and query the SYSTEM.STATS table to get an
>> estimate [3].
>>
>> Another interesting alternative if you want the approximate row count
>> when you have a where clause would be to use the new table sampling feature
>> [4]. You'd also want stats enabled for this to be more accurate too.
>>
>> Thanks,
>> James
>>
>>
>> [1] https://phoenix.apache.org/update_statistics.html
>> [2] phoenix.use.stats.parallelization=false
>> [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
>> physical_name='my_schema.my_table'
>>      and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
>> have multiple column families
>> [4] https://phoenix.apache.org/tablesample.html
>>
>> On Tue, Dec 19, 2017 at 2:57 PM, Jins George < <[email protected]>
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> Is there a way to get the total row count of a phoenix table without
>>> running select count(*) from table ?
>>> my use case is to monitor the record count in a table every x minutes,
>>> so didn't want to put load on the system by running a select count(*) query.
>>>
>>> Thanks,
>>> Jins George
>>>
>>
>>
>
>

Re: Efficient way to get the row count of a table

Reply via email to