Re: Efficient way to get the row count of a table

2017-12-20 Thread Jins George
Thank You James. That's great news that 4.13 for CDH is coming. I will try the 
same on the new version once its released.


Thanks,

Jins George


From: James Taylor 
Sent: Tuesday, December 19, 2017 5:38:38 PM
To: user
Subject: Re: Efficient way to get the row count of a table

The count would change when a major compaction is done. Back in 4.7, it may 
have changed when a split occurs too (but this is no longer the case). I'd 
recommend moving to a newer version: 4.7 was release almost two years ago and 
is six releases back from the current 4.13 release. FWIW, we're getting ready 
to release a 4.13 for CDH.

On Tue, Dec 19, 2017 at 4:27 PM, Jins George 
> wrote:
Thank you James and Mujtaba  for the responses.

I am OK with an estimate count . So using SYSTEM.STATS table looks good in my 
case. But when I tried this, it gave me inconsistent results.  for example.

Time 1:
select count(*) from myschema.mytable  => 3474085
select sum(guide_posts_row_count) from system.stats where physical_name = 
'myschema.device_data'  => 3348090

Time 2 : ( Time1 + ~10 mins)
select count(*) from myschema.mytable  => 3474106
select sum(guide_posts_row_count) from system.stats where physical_name = 
'myschema.device_data'  => 3348080

So I was expecting the stats count to go up but surprisingly, the count went 
down.  Is there a specific configuration or something else that I am missing?

I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.

Thanks,
Jins George



On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
Another alternate outside Phoenix is to use 

 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html
 M/R.

On Tue, Dec 19, 2017 at 3:18 PM, James Taylor 
> wrote:
If it needs to be 100% accurate, then count(*) is the only way. If your data is 
write-once data, you might be able to track the row count at the application 
level through some kind of atomic counter in a different table (but this will 
likely be brittle). If you can live with an estimate, you could enable 
statistics [1], optionally configuring Phoenix not to use stats for 
parallelization [2], and query the SYSTEM.STATS table to get an estimate [3].

Another interesting alternative if you want the approximate row count when you 
have a where clause would be to use the new table sampling feature [4]. You'd 
also want stats enabled for this to be more accurate too.

Thanks,
James


[1] https://phoenix.apache.org/update_statistics.html
[2] phoenix.use.stats.parallelization=false
[3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where 
physical_name='my_schema.my_table'
 and COLUMN_FAMILY='my_first_column_family' -- necessary only if you have 
multiple column families
[4] https://phoenix.apache.org/tablesample.html

On Tue, Dec 19, 2017 at 2:57 PM, Jins George 
<jins.geo...@aeris.net>
 wrote:
Hi,

Is there a way to get the total row count of a phoenix table without running 
select count(*) from table ?
my use case is to monitor the record count in a table every x minutes, so 
didn't want to put load on the system by running a select count(*) query.

Thanks,
Jins George






Re: Efficient way to get the row count of a table

2017-12-19 Thread James Taylor
The count would change when a major compaction is done. Back in 4.7, it may
have changed when a split occurs too (but this is no longer the case). I'd
recommend moving to a newer version: 4.7 was release almost two years ago
and is six releases back from the current 4.13 release. FWIW, we're getting
ready to release a 4.13 for CDH.

On Tue, Dec 19, 2017 at 4:27 PM, Jins George  wrote:

> Thank you James and Mujtaba  for the responses.
>
> I am OK with an estimate count . So using SYSTEM.STATS table looks good in
> my case. But when I tried this, it gave me inconsistent results.  for
> example.
>
> Time 1:
> select count(*) from myschema.mytable  => 3474085
> select sum(guide_posts_row_count) from system.stats where physical_name =
> 'myschema.device_data'  => 3348090
>
> Time 2 : ( Time1 + ~10 mins)
> select count(*) from myschema.mytable  => 3474106
> select sum(guide_posts_row_count) from system.stats where physical_name =
> 'myschema.device_data'  => 3348080
>
> So I was expecting the stats count to go up but surprisingly, the count
> went down.  Is there a specific configuration or something else that I am
> missing?
>
> I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.
>
> Thanks,
> Jins George
>
>
>
> On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
>
> Another alternate outside Phoenix is to use
> 
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/
> mapreduce/RowCounter.html M/R.
>
> On Tue, Dec 19, 2017 at 3:18 PM, James Taylor 
> wrote:
>
>> If it needs to be 100% accurate, then count(*) is the only way. If your
>> data is write-once data, you might be able to track the row count at the
>> application level through some kind of atomic counter in a different table
>> (but this will likely be brittle). If you can live with an estimate, you
>> could enable statistics [1], optionally configuring Phoenix not to use
>> stats for parallelization [2], and query the SYSTEM.STATS table to get an
>> estimate [3].
>>
>> Another interesting alternative if you want the approximate row count
>> when you have a where clause would be to use the new table sampling feature
>> [4]. You'd also want stats enabled for this to be more accurate too.
>>
>> Thanks,
>> James
>>
>>
>> [1] https://phoenix.apache.org/update_statistics.html
>> [2] phoenix.use.stats.parallelization=false
>> [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
>> physical_name='my_schema.my_table'
>>  and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
>> have multiple column families
>> [4] https://phoenix.apache.org/tablesample.html
>>
>> On Tue, Dec 19, 2017 at 2:57 PM, Jins George < 
>> jins.geo...@aeris.net> wrote:
>>
>>> Hi,
>>>
>>> Is there a way to get the total row count of a phoenix table without
>>> running select count(*) from table ?
>>> my use case is to monitor the record count in a table every x minutes,
>>> so didn't want to put load on the system by running a select count(*) query.
>>>
>>> Thanks,
>>> Jins George
>>>
>>
>>
>
>


Re: Efficient way to get the row count of a table

2017-12-19 Thread Jins George

Thank you James and Mujtaba  for the responses.

I am OK with an estimate count . So using SYSTEM.STATS table looks good 
in my case. But when I tried this, it gave me inconsistent results.  for 
example.


Time 1:
select count(*) from myschema.mytable  => 3474085
select sum(guide_posts_row_count) from system.stats where physical_name 
= 'myschema.device_data'  => 3348090


Time 2 : ( Time1 + ~10 mins)
select count(*) from myschema.mytable  => 3474106
select sum(guide_posts_row_count) from system.stats where physical_name 
= 'myschema.device_data'  => 3348080


So I was expecting the stats count to go up but surprisingly, the count 
went down.  Is there a specific configuration or something else that I 
am missing?


I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.

Thanks,
Jins George


On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
Another alternate outside Phoenix is to use 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html 
M/R.


On Tue, Dec 19, 2017 at 3:18 PM, James Taylor > wrote:


If it needs to be 100% accurate, then count(*) is the only way. If
your data is write-once data, you might be able to track the row
count at the application level through some kind of atomic counter
in a different table (but this will likely be brittle). If you can
live with an estimate, you could enable statistics [1], optionally
configuring Phoenix not to use stats for parallelization [2], and
query the SYSTEM.STATS table to get an estimate [3].

Another interesting alternative if you want the approximate row
count when you have a where clause would be to use the new table
sampling feature [4]. You'd also want stats enabled for this to be
more accurate too.

Thanks,
James


[1] https://phoenix.apache.org/update_statistics.html

[2] phoenix.use.stats.parallelization=false
[3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
physical_name='my_schema.my_table'
 and COLUMN_FAMILY='my_first_column_family' -- necessary only
if you have multiple column families
[4] https://phoenix.apache.org/tablesample.html


On Tue, Dec 19, 2017 at 2:57 PM, Jins George
> wrote:

Hi,

Is there a way to get the total row count of a phoenix table
without running select count(*) from table ?
my use case is to monitor the record count in a table every x
minutes, so didn't want to put load on the system by running a
select count(*) query.

Thanks,
Jins George







Re: Efficient way to get the row count of a table

2017-12-19 Thread Mujtaba Chohan
Another alternate outside Phoenix is to use
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html
M/R.

On Tue, Dec 19, 2017 at 3:18 PM, James Taylor 
wrote:

> If it needs to be 100% accurate, then count(*) is the only way. If your
> data is write-once data, you might be able to track the row count at the
> application level through some kind of atomic counter in a different table
> (but this will likely be brittle). If you can live with an estimate, you
> could enable statistics [1], optionally configuring Phoenix not to use
> stats for parallelization [2], and query the SYSTEM.STATS table to get an
> estimate [3].
>
> Another interesting alternative if you want the approximate row count when
> you have a where clause would be to use the new table sampling feature [4].
> You'd also want stats enabled for this to be more accurate too.
>
> Thanks,
> James
>
>
> [1] https://phoenix.apache.org/update_statistics.html
> [2] phoenix.use.stats.parallelization=false
> [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
> physical_name='my_schema.my_table'
>  and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
> have multiple column families
> [4] https://phoenix.apache.org/tablesample.html
>
> On Tue, Dec 19, 2017 at 2:57 PM, Jins George 
> wrote:
>
>> Hi,
>>
>> Is there a way to get the total row count of a phoenix table without
>> running select count(*) from table ?
>> my use case is to monitor the record count in a table every x minutes, so
>> didn't want to put load on the system by running a select count(*) query.
>>
>> Thanks,
>> Jins George
>>
>
>


Re: Efficient way to get the row count of a table

2017-12-19 Thread James Taylor
If it needs to be 100% accurate, then count(*) is the only way. If your
data is write-once data, you might be able to track the row count at the
application level through some kind of atomic counter in a different table
(but this will likely be brittle). If you can live with an estimate, you
could enable statistics [1], optionally configuring Phoenix not to use
stats for parallelization [2], and query the SYSTEM.STATS table to get an
estimate [3].

Another interesting alternative if you want the approximate row count when
you have a where clause would be to use the new table sampling feature [4].
You'd also want stats enabled for this to be more accurate too.

Thanks,
James


[1] https://phoenix.apache.org/update_statistics.html
[2] phoenix.use.stats.parallelization=false
[3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
physical_name='my_schema.my_table'
 and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
have multiple column families
[4] https://phoenix.apache.org/tablesample.html

On Tue, Dec 19, 2017 at 2:57 PM, Jins George  wrote:

> Hi,
>
> Is there a way to get the total row count of a phoenix table without
> running select count(*) from table ?
> my use case is to monitor the record count in a table every x minutes, so
> didn't want to put load on the system by running a select count(*) query.
>
> Thanks,
> Jins George
>


Efficient way to get the row count of a table

2017-12-19 Thread Jins George

Hi,

Is there a way to get the total row count of a phoenix table without 
running select count(*) from table ?
my use case is to monitor the record count in a table every x minutes, 
so didn't want to put load on the system by running a select count(*) 
query.


Thanks,
Jins George