Re: Efficient way to get the row count of a table

Jins George Tue, 19 Dec 2017 16:28:16 -0800

Thank you James and Mujtaba  for the responses.

I am OK with an estimate count . So using SYSTEM.STATS table looks goodin my case. But when I tried this, it gave me inconsistent results. forexample.


Time 1:
select count(*) from myschema.mytable  => 3474085

select sum(guide_posts_row_count) from system.stats where physical_name= 'myschema.device_data' => 3348090


Time 2 : ( Time1 + ~10 mins)
select count(*) from myschema.mytable  => 3474106

select sum(guide_posts_row_count) from system.stats where physical_name= 'myschema.device_data' => 3348080

So I was expecting the stats count to go up but surprisingly, the countwent down. Is there a specific configuration or something else that Iam missing?


I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.

Thanks,
Jins George


On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:

Another alternate outside Phoenix is to usehttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.htmlM/R.

On Tue, Dec 19, 2017 at 3:18 PM, James Taylor <jamestay...@apache.org<mailto:jamestay...@apache.org>> wrote:


    If it needs to be 100% accurate, then count(*) is the only way. If
    your data is write-once data, you might be able to track the row
    count at the application level through some kind of atomic counter
    in a different table (but this will likely be brittle). If you can
    live with an estimate, you could enable statistics [1], optionally
    configuring Phoenix not to use stats for parallelization [2], and
    query the SYSTEM.STATS table to get an estimate [3].

    Another interesting alternative if you want the approximate row
    count when you have a where clause would be to use the new table
    sampling feature [4]. You'd also want stats enabled for this to be
    more accurate too.

    Thanks,
    James


    [1] https://phoenix.apache.org/update_statistics.html
    <https://phoenix.apache.org/update_statistics.html>
    [2] phoenix.use.stats.parallelization=false
    [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
    physical_name='my_schema.my_table'
         and COLUMN_FAMILY='my_first_column_family' -- necessary only
    if you have multiple column families
    [4] https://phoenix.apache.org/tablesample.html
    <https://phoenix.apache.org/tablesample.html>

    On Tue, Dec 19, 2017 at 2:57 PM, Jins George
    <jins.geo...@aeris.net <mailto:jins.geo...@aeris.net>> wrote:

        Hi,

        Is there a way to get the total row count of a phoenix table
        without running select count(*) from table ?
        my use case is to monitor the record count in a table every x
        minutes, so didn't want to put load on the system by running a
        select count(*) query.

        Thanks,
        Jins George

Re: Efficient way to get the row count of a table

Reply via email to