Re: Efficient way to get the row count of a table

Jins George Wed, 20 Dec 2017 16:56:35 -0800

Thank You James. That's great news that 4.13 for CDH is coming. I will try the 
same on the new version once its released.

Thanks,

Jins George

________________________________
From: James Taylor <jamestay...@apache.org>
Sent: Tuesday, December 19, 2017 5:38:38 PM
To: user
Subject: Re: Efficient way to get the row count of a table

The count would change when a major compaction is done. Back in 4.7, it may 
have changed when a split occurs too (but this is no longer the case). I'd 
recommend moving to a newer version: 4.7 was release almost two years ago and 
is six releases back from the current 4.13 release. FWIW, we're getting ready 
to release a 4.13 for CDH.

On Tue, Dec 19, 2017 at 4:27 PM, Jins George 
<jins.geo...@aeris.net<mailto:jins.geo...@aeris.net>> wrote:
Thank you James and Mujtaba  for the responses.

I am OK with an estimate count . So using SYSTEM.STATS table looks good in my 
case. But when I tried this, it gave me inconsistent results.  for example.

Time 1:
select count(*) from myschema.mytable  => 3474085
select sum(guide_posts_row_count) from system.stats where physical_name = 
'myschema.device_data'  => 3348090

Time 2 : ( Time1 + ~10 mins)
select count(*) from myschema.mytable  => 3474106
select sum(guide_posts_row_count) from system.stats where physical_name = 
'myschema.device_data'  => 3348080

So I was expecting the stats count to go up but surprisingly, the count went 
down.  Is there a specific configuration or something else that I am missing?

I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.

Thanks,
Jins George

On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
Another alternate outside Phoenix is to use 
<http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html>

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html
 M/R.

On Tue, Dec 19, 2017 at 3:18 PM, James Taylor 
<jamestay...@apache.org<mailto:jamestay...@apache.org>> wrote:
If it needs to be 100% accurate, then count(*) is the only way. If your data is 
write-once data, you might be able to track the row count at the application 
level through some kind of atomic counter in a different table (but this will 
likely be brittle). If you can live with an estimate, you could enable 
statistics [1], optionally configuring Phoenix not to use stats for 
parallelization [2], and query the SYSTEM.STATS table to get an estimate [3].

Another interesting alternative if you want the approximate row count when you 
have a where clause would be to use the new table sampling feature [4]. You'd 
also want stats enabled for this to be more accurate too.

Thanks,
James

[1] https://phoenix.apache.org/update_statistics.html
[2] phoenix.use.stats.parallelization=false
[3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where 
physical_name='my_schema.my_table'
     and COLUMN_FAMILY='my_first_column_family' -- necessary only if you have 
multiple column families
[4] https://phoenix.apache.org/tablesample.html

On Tue, Dec 19, 2017 at 2:57 PM, Jins George 
<<mailto:jins.geo...@aeris.net>jins.geo...@aeris.net<mailto:jins.geo...@aeris.net>>
 wrote:
Hi,

Is there a way to get the total row count of a phoenix table without running 
select count(*) from table ?
my use case is to monitor the record count in a table every x minutes, so 
didn't want to put load on the system by running a select count(*) query.

Thanks,
Jins George

Re: Efficient way to get the row count of a table

Reply via email to