can you add the parcels download location to phonix download page (http://phoenix.apache.org/download.html) ?

2017-12-19 Thread Bulvik, Noam


Noam




PRIVILEGED AND CONFIDENTIAL
PLEASE NOTE: The information contained in this message is privileged and 
confidential, and is intended only for the use of the individual to whom it is 
addressed and others who have been specifically authorized to receive it. If 
you are not the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, or if any 
problems occur with transmission, please contact sender. Thank you.


Re: Efficient way to get the row count of a table

2017-12-19 Thread James Taylor
The count would change when a major compaction is done. Back in 4.7, it may
have changed when a split occurs too (but this is no longer the case). I'd
recommend moving to a newer version: 4.7 was release almost two years ago
and is six releases back from the current 4.13 release. FWIW, we're getting
ready to release a 4.13 for CDH.

On Tue, Dec 19, 2017 at 4:27 PM, Jins George  wrote:

> Thank you James and Mujtaba  for the responses.
>
> I am OK with an estimate count . So using SYSTEM.STATS table looks good in
> my case. But when I tried this, it gave me inconsistent results.  for
> example.
>
> Time 1:
> select count(*) from myschema.mytable  => 3474085
> select sum(guide_posts_row_count) from system.stats where physical_name =
> 'myschema.device_data'  => 3348090
>
> Time 2 : ( Time1 + ~10 mins)
> select count(*) from myschema.mytable  => 3474106
> select sum(guide_posts_row_count) from system.stats where physical_name =
> 'myschema.device_data'  => 3348080
>
> So I was expecting the stats count to go up but surprisingly, the count
> went down.  Is there a specific configuration or something else that I am
> missing?
>
> I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.
>
> Thanks,
> Jins George
>
>
>
> On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
>
> Another alternate outside Phoenix is to use
> 
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/
> mapreduce/RowCounter.html M/R.
>
> On Tue, Dec 19, 2017 at 3:18 PM, James Taylor 
> wrote:
>
>> If it needs to be 100% accurate, then count(*) is the only way. If your
>> data is write-once data, you might be able to track the row count at the
>> application level through some kind of atomic counter in a different table
>> (but this will likely be brittle). If you can live with an estimate, you
>> could enable statistics [1], optionally configuring Phoenix not to use
>> stats for parallelization [2], and query the SYSTEM.STATS table to get an
>> estimate [3].
>>
>> Another interesting alternative if you want the approximate row count
>> when you have a where clause would be to use the new table sampling feature
>> [4]. You'd also want stats enabled for this to be more accurate too.
>>
>> Thanks,
>> James
>>
>>
>> [1] https://phoenix.apache.org/update_statistics.html
>> [2] phoenix.use.stats.parallelization=false
>> [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
>> physical_name='my_schema.my_table'
>>  and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
>> have multiple column families
>> [4] https://phoenix.apache.org/tablesample.html
>>
>> On Tue, Dec 19, 2017 at 2:57 PM, Jins George < 
>> jins.geo...@aeris.net> wrote:
>>
>>> Hi,
>>>
>>> Is there a way to get the total row count of a phoenix table without
>>> running select count(*) from table ?
>>> my use case is to monitor the record count in a table every x minutes,
>>> so didn't want to put load on the system by running a select count(*) query.
>>>
>>> Thanks,
>>> Jins George
>>>
>>
>>
>
>


Re: Efficient way to get the row count of a table

2017-12-19 Thread Jins George

Thank you James and Mujtaba  for the responses.

I am OK with an estimate count . So using SYSTEM.STATS table looks good 
in my case. But when I tried this, it gave me inconsistent results.  for 
example.


Time 1:
select count(*) from myschema.mytable  => 3474085
select sum(guide_posts_row_count) from system.stats where physical_name 
= 'myschema.device_data'  => 3348090


Time 2 : ( Time1 + ~10 mins)
select count(*) from myschema.mytable  => 3474106
select sum(guide_posts_row_count) from system.stats where physical_name 
= 'myschema.device_data'  => 3348080


So I was expecting the stats count to go up but surprisingly, the count 
went down.  Is there a specific configuration or something else that I 
am missing?


I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.

Thanks,
Jins George


On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
Another alternate outside Phoenix is to use 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html 
M/R.


On Tue, Dec 19, 2017 at 3:18 PM, James Taylor > wrote:


If it needs to be 100% accurate, then count(*) is the only way. If
your data is write-once data, you might be able to track the row
count at the application level through some kind of atomic counter
in a different table (but this will likely be brittle). If you can
live with an estimate, you could enable statistics [1], optionally
configuring Phoenix not to use stats for parallelization [2], and
query the SYSTEM.STATS table to get an estimate [3].

Another interesting alternative if you want the approximate row
count when you have a where clause would be to use the new table
sampling feature [4]. You'd also want stats enabled for this to be
more accurate too.

Thanks,
James


[1] https://phoenix.apache.org/update_statistics.html

[2] phoenix.use.stats.parallelization=false
[3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
physical_name='my_schema.my_table'
 and COLUMN_FAMILY='my_first_column_family' -- necessary only
if you have multiple column families
[4] https://phoenix.apache.org/tablesample.html


On Tue, Dec 19, 2017 at 2:57 PM, Jins George
> wrote:

Hi,

Is there a way to get the total row count of a phoenix table
without running select count(*) from table ?
my use case is to monitor the record count in a table every x
minutes, so didn't want to put load on the system by running a
select count(*) query.

Thanks,
Jins George







Re: Efficient way to get the row count of a table

2017-12-19 Thread Mujtaba Chohan
Another alternate outside Phoenix is to use
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html
M/R.

On Tue, Dec 19, 2017 at 3:18 PM, James Taylor 
wrote:

> If it needs to be 100% accurate, then count(*) is the only way. If your
> data is write-once data, you might be able to track the row count at the
> application level through some kind of atomic counter in a different table
> (but this will likely be brittle). If you can live with an estimate, you
> could enable statistics [1], optionally configuring Phoenix not to use
> stats for parallelization [2], and query the SYSTEM.STATS table to get an
> estimate [3].
>
> Another interesting alternative if you want the approximate row count when
> you have a where clause would be to use the new table sampling feature [4].
> You'd also want stats enabled for this to be more accurate too.
>
> Thanks,
> James
>
>
> [1] https://phoenix.apache.org/update_statistics.html
> [2] phoenix.use.stats.parallelization=false
> [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
> physical_name='my_schema.my_table'
>  and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
> have multiple column families
> [4] https://phoenix.apache.org/tablesample.html
>
> On Tue, Dec 19, 2017 at 2:57 PM, Jins George 
> wrote:
>
>> Hi,
>>
>> Is there a way to get the total row count of a phoenix table without
>> running select count(*) from table ?
>> my use case is to monitor the record count in a table every x minutes, so
>> didn't want to put load on the system by running a select count(*) query.
>>
>> Thanks,
>> Jins George
>>
>
>


Re: Efficient way to get the row count of a table

2017-12-19 Thread James Taylor
If it needs to be 100% accurate, then count(*) is the only way. If your
data is write-once data, you might be able to track the row count at the
application level through some kind of atomic counter in a different table
(but this will likely be brittle). If you can live with an estimate, you
could enable statistics [1], optionally configuring Phoenix not to use
stats for parallelization [2], and query the SYSTEM.STATS table to get an
estimate [3].

Another interesting alternative if you want the approximate row count when
you have a where clause would be to use the new table sampling feature [4].
You'd also want stats enabled for this to be more accurate too.

Thanks,
James


[1] https://phoenix.apache.org/update_statistics.html
[2] phoenix.use.stats.parallelization=false
[3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
physical_name='my_schema.my_table'
 and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
have multiple column families
[4] https://phoenix.apache.org/tablesample.html

On Tue, Dec 19, 2017 at 2:57 PM, Jins George  wrote:

> Hi,
>
> Is there a way to get the total row count of a phoenix table without
> running select count(*) from table ?
> my use case is to monitor the record count in a table every x minutes, so
> didn't want to put load on the system by running a select count(*) query.
>
> Thanks,
> Jins George
>


Efficient way to get the row count of a table

2017-12-19 Thread Jins George

Hi,

Is there a way to get the total row count of a phoenix table without 
running select count(*) from table ?
my use case is to monitor the record count in a table every x minutes, 
so didn't want to put load on the system by running a select count(*) 
query.


Thanks,
Jins George