Re: Query performance analysis when using LIMIT clause

2020-01-07 Thread talluri abhishek
Hi All,

Bumping this up to see if anyone has any thoughts on this behavior when
using a LIMIT clause.

Thanks,
Abhishek

On Wed, Dec 4, 2019 at 12:47 PM talluri abhishek 
wrote:

> Hi All,
>
> My setup has phoenix 4.14 and a table with 2 column families(CF1 & CF2)
> where CF1 has around 100 columns and CF2 has 3 columns. Below are a few
> queries which show the difference in execution times with and without limit
> clause and their query plans. There is almost > 20x performance degradation
> when using limit clause on these queries. Any thoughts on this behavior?
>
> 0: jdbc:phoenix:> select CF1.AA, CF2.A, CF2.B from test_limit where CF2.A
> = 'lgak';
>
> *+---+---+---+*
>
> *| ** AA  ** | **  A  ** | **  B  ** |*
>
> *+---+---+---+*
>
> *| *znvv * | *lgak * | *wjkm * |*
>
> *| *kiry * | *lgak * | *gnpu * |*
>
> *| *qbnp * | *lgak * | *yowh * |*
>
> *| *xzfc * | *lgak * | *nibn * |*
>
> *+---+---+---+*
>
> 4 rows selected (0.603 seconds)
>
> 0: jdbc:phoenix:> select CF1.AA, CF2.A, CF2.B from test_limit where CF2.A
> = 'lgak' limit 2;
>
> *+---+---+---+*
>
> *| ** AA  ** | **  A  ** | **  B  ** |*
>
> *+---+---+---+*
>
> *| *znvv * | *lgak * | *wjkm * |*
>
> *| *kiry * | *lgak * | *gnpu * |*
>
> *+---+---+---+*
>
> 2 rows selected (12.115 seconds)
>
> 0: jdbc:phoenix:> select CF1.AA, CF2.A, CF2.B from test_limit where CF2.A
> = 'lgak' limit 10;
>
> *+---+---+---+*
>
> *| ** AA  ** | **  A  ** | **  B  ** |*
>
> *+---+---+---+*
>
> *| *znvv * | *lgak * | *wjkm * |*
>
> *| *kiry * | *lgak * | *gnpu * |*
>
> *| *qbnp * | *lgak * | *yowh * |*
>
> *| *xzfc * | *lgak * | *nibn * |*
>
> *+---+---+---+*
>
> 4 rows selected (15.338 seconds)
>
> 0: jdbc:phoenix:> explain select CF1.AA, CF2.A, CF2.B from test_limit
> where CF2.A = 'lgak';
>
>
> *+-+-+++*
>
> *| ** PLAN
>** | **EST_BYTES_READ ** | **EST_ROWS_READ **
> | ** EST_INFO_TS  ** |*
>
>
> *+-+-+++*
>
> *| *CLIENT 2-CHUNK 35112 ROWS 314572800 BYTES PARALLEL 1-WAY ROUND ROBIN
> FULL SCAN OVER TEST_LIMIT * | *314572800  * | *35112 * | *
> 1575395762384 * |*
>
> *| *SERVER FILTER BY CF2.A = 'lgak'
>  * | *314572800  * | *35112 * | *
> 1575395762384 * |*
>
>
> *+-+-+++*
>
> 2 rows selected (0.033 seconds)
>
> 0: jdbc:phoenix:> explain select CF1.AA, CF2.A, CF2.B from test_limit
> where CF2.A = 'lgak' limit 10;
>
>
> *+-+-+++*
>
> *| ** PLAN
>** | **EST_BYTES_READ ** | **EST_ROWS_READ **
> | ** EST_INFO_TS  ** |*
>
>
> *+-+-+++*
>
> *| *CLIENT 2-CHUNK 35112 ROWS 314572800 BYTES PARALLEL 1-WAY ROUND ROBIN
> FULL SCAN OVER TEST_LIMIT * | *314572800  * | *35112 * | *
> 1575395762384 * |*
>
> *| *SERVER FILTER BY CF2.A = 'lgak'
>  * | *314572800  * | *35112 * | *
> 1575395762384 * |*
>
> *| *SERVER 10 ROW LIMIT
>  * | *314572800  * | *35112 * | *
> 1575395762384 * |*
>
> *| *CLIENT 10 ROW LIMIT
>  * | *314572800  * | *35112 * | *
> 1575395762384 * |*
>
>
> *+-+-+++*
>
> 4 rows selected (0.032 seconds)
>
>


Query performance analysis when using LIMIT clause

2019-12-04 Thread talluri abhishek
Hi All,

My setup has phoenix 4.14 and a table with 2 column families(CF1 & CF2)
where CF1 has around 100 columns and CF2 has 3 columns. Below are a few
queries which show the difference in execution times with and without limit
clause and their query plans. There is almost > 20x performance degradation
when using limit clause on these queries. Any thoughts on this behavior?

0: jdbc:phoenix:> select CF1.AA, CF2.A, CF2.B from test_limit where CF2.A =
'lgak';

*+---+---+---+*

*| ** AA  ** | **  A  ** | **  B  ** |*

*+---+---+---+*

*| *znvv * | *lgak * | *wjkm * |*

*| *kiry * | *lgak * | *gnpu * |*

*| *qbnp * | *lgak * | *yowh * |*

*| *xzfc * | *lgak * | *nibn * |*

*+---+---+---+*

4 rows selected (0.603 seconds)

0: jdbc:phoenix:> select CF1.AA, CF2.A, CF2.B from test_limit where CF2.A =
'lgak' limit 2;

*+---+---+---+*

*| ** AA  ** | **  A  ** | **  B  ** |*

*+---+---+---+*

*| *znvv * | *lgak * | *wjkm * |*

*| *kiry * | *lgak * | *gnpu * |*

*+---+---+---+*

2 rows selected (12.115 seconds)

0: jdbc:phoenix:> select CF1.AA, CF2.A, CF2.B from test_limit where CF2.A =
'lgak' limit 10;

*+---+---+---+*

*| ** AA  ** | **  A  ** | **  B  ** |*

*+---+---+---+*

*| *znvv * | *lgak * | *wjkm * |*

*| *kiry * | *lgak * | *gnpu * |*

*| *qbnp * | *lgak * | *yowh * |*

*| *xzfc * | *lgak * | *nibn * |*

*+---+---+---+*

4 rows selected (15.338 seconds)

0: jdbc:phoenix:> explain select CF1.AA, CF2.A, CF2.B from test_limit where
CF2.A = 'lgak';

*+-+-+++*

*| ** PLAN
 ** | **EST_BYTES_READ ** | **EST_ROWS_READ ** | **
EST_INFO_TS  ** |*

*+-+-+++*

*| *CLIENT 2-CHUNK 35112 ROWS 314572800 BYTES PARALLEL 1-WAY ROUND ROBIN
FULL SCAN OVER TEST_LIMIT * | *314572800  * | *35112 * | *
1575395762384 * |*

*| *SERVER FILTER BY CF2.A = 'lgak'
   * | *314572800  * | *35112 * | *
1575395762384 * |*

*+-+-+++*

2 rows selected (0.033 seconds)

0: jdbc:phoenix:> explain select CF1.AA, CF2.A, CF2.B from test_limit where
CF2.A = 'lgak' limit 10;

*+-+-+++*

*| ** PLAN
 ** | **EST_BYTES_READ ** | **EST_ROWS_READ ** | **
EST_INFO_TS  ** |*

*+-+-+++*

*| *CLIENT 2-CHUNK 35112 ROWS 314572800 BYTES PARALLEL 1-WAY ROUND ROBIN
FULL SCAN OVER TEST_LIMIT * | *314572800  * | *35112 * | *
1575395762384 * |*

*| *SERVER FILTER BY CF2.A = 'lgak'
   * | *314572800  * | *35112 * | *
1575395762384 * |*

*| *SERVER 10 ROW LIMIT
   * | *314572800  * | *35112 * | *
1575395762384 * |*

*| *CLIENT 10 ROW LIMIT
   * | *314572800  * | *35112 * | *
1575395762384 * |*

*+-+-+++*

4 rows selected (0.032 seconds)


Re: Arithmetic Error in a select query

2019-02-13 Thread talluri abhishek
Thanks, Jaanai.

For the second question, did you mean the region server logs?

Also, I see that Phoenix has tracing features that we can enable. Could we
enable it to get more information or what is it that tracing provides that
region server logs cannot?

-AT

On Thu, Feb 14, 2019 at 8:37 AM Jaanai Zhang  wrote:

> 1. How do we capture the debug level logs on this jdbc client? Should we
>> also enable debug level on the region servers to understand what is
>> triggering this error?
>>
>
> You can set the logging level in the log4j configuration file on the
> client side, most of the time we can disable the debug level.
>
> 2. Primary key on that table had a not null constraint and not sure why
>> the error was stating null?
>>
>
> This is an error of the server side, perhaps you can find some exceptions
> from the log files.
>
> --------
>Jaanai Zhang
>Best regards!
>
>
>
> talluri abhishek  于2019年2月13日周三 上午11:06写道:
>
>> Hi All,
>>
>> I have a `select primary_key from table_name limit 50` query that
>> works most of the time but it returns the below error at times.
>> ERROR 212 (22012): Arithmetic error on server. ERROR 212 (22012):
>> Arithmetic error on server. null
>> We are using a jdbc client to query phoenix and have the following
>> questions in mind
>>
>> 1. How do we capture the debug level logs on this jdbc client? Should we
>> also enable debug level on the region servers to understand what is
>> triggering this error?
>> 2. Primary key on that table had a not null constraint and not sure why
>> the error was stating null?
>>
>> Thanks,
>> Abhishek
>>
>


Arithmetic Error in a select query

2019-02-12 Thread talluri abhishek
Hi All,

I have a `select primary_key from table_name limit 50` query that works
most of the time but it returns the below error at times.
ERROR 212 (22012): Arithmetic error on server. ERROR 212 (22012):
Arithmetic error on server. null
We are using a jdbc client to query phoenix and have the following
questions in mind

1. How do we capture the debug level logs on this jdbc client? Should we
also enable debug level on the region servers to understand what is
triggering this error?
2. Primary key on that table had a not null constraint and not sure why the
error was stating null?

Thanks,
Abhishek


Re: Growth in table size and performance degradation on read-queries

2019-01-29 Thread talluri abhishek
Hi Vincent,

Versions is set to1 and keep_deleted_cells is false. It's basically the
default settings and nothing has been changed.

describe on the hbase table gives below:

VERSIONS => '1', MIN_VERSIONS => '0', TTL => 'FOREVER',
KEEP_DELETED_CELLS => 'FALSE'


Thanks,
Abhishek

On Tue, Jan 29, 2019 at 3:20 PM Vincent Poon  wrote:

> is your max_versions set to 1 ?  keep_deleted_cells?
>
> On Tue, Jan 29, 2019 at 10:41 AM talluri abhishek <
> abhishektall...@gmail.com> wrote:
>
>> Hi All,
>>
>> We are seeing a couple of issues on some of our Phoenix tables where the
>> size of the tables keep growing 2-3 times after around 2-3 days of
>> ingestion and the read performance takes a big hit after that. Now, if we
>> insert overwrite the data in that table to a new copy table, the data size
>> comes back to normal size and the queries perform fast on that copy table.
>>
>> Initial table size after 1st day ~ 5G
>> After 2 days of ingestion ~ 15G
>> Re-write into a copy table ~ 5-6 G
>>
>> Query performance becomes proportional to the size of the table, lets say
>> the query took 40 secs to run on the original table after first day, it
>> takes around 130-160 secs after 2 days of ingestion. The same query when
>> run on the copy table finishes in around ~40secs.
>>
>> Most of the ingested data after the first day are mostly updates
>> happening on the existing rows, so we thought major compaction should solve
>> the size issue but it does not shrink the size every time (load happens in
>> parallel when the compaction is run).
>> Write performance is always good and we have used salt buckets to even
>> out the writes. The primary key is a 12-bit string which is made by the
>> concatenation of some account id and an auto-generated transaction number.
>>
>> One query that has a toll on its performance as mentioned above is:
>> *select (list of 50-70 columns) from original_table where account_id IN
>> (list of 100k account ids) *[account_id in this query is the primary key
>> on that table]
>>
>> We are currently increasing the heap space on these region servers to
>> provide more memstore size, which could reduce the number of flushes for
>> the upserted data.
>>
>> Could there be any other reason for the increase in the size of the table
>> apart from the updated rows? How could we better the performance of those
>> read queries?
>>
>> Thanks,
>> Abhishek
>>
>


Growth in table size and performance degradation on read-queries

2019-01-29 Thread talluri abhishek
Hi All,

We are seeing a couple of issues on some of our Phoenix tables where the
size of the tables keep growing 2-3 times after around 2-3 days of
ingestion and the read performance takes a big hit after that. Now, if we
insert overwrite the data in that table to a new copy table, the data size
comes back to normal size and the queries perform fast on that copy table.

Initial table size after 1st day ~ 5G
After 2 days of ingestion ~ 15G
Re-write into a copy table ~ 5-6 G

Query performance becomes proportional to the size of the table, lets say
the query took 40 secs to run on the original table after first day, it
takes around 130-160 secs after 2 days of ingestion. The same query when
run on the copy table finishes in around ~40secs.

Most of the ingested data after the first day are mostly updates happening
on the existing rows, so we thought major compaction should solve the size
issue but it does not shrink the size every time (load happens in
parallel when the compaction is run).
Write performance is always good and we have used salt buckets to even out
the writes. The primary key is a 12-bit string which is made by the
concatenation of some account id and an auto-generated transaction number.

One query that has a toll on its performance as mentioned above is:
*select (list of 50-70 columns) from original_table where account_id IN
(list of 100k account ids) *[account_id in this query is the primary key on
that table]

We are currently increasing the heap space on these region servers to
provide more memstore size, which could reduce the number of flushes for
the upserted data.

Could there be any other reason for the increase in the size of the table
apart from the updated rows? How could we better the performance of those
read queries?

Thanks,
Abhishek


Re: phoenix-spark plugin with Spark 2.3

2019-01-10 Thread talluri abhishek
Thanks Youngwoo. That helps to know that we just need to change the client
jar for a major spark release upgrade on the extraClassPath.

-Abhishek

On Wed, Jan 9, 2019 at 8:43 PM Youngwoo Kim (김영우) 
wrote:

> Abhishek,
>
> If you want upgrade Spark major version, you should ONLY re-build
> phoenix-client.jar for your spark application. That means you don't need to
> rebuild/upgrade entire Phoenix packages for the cluster and don't need to
> re-insert your data. Just make sure that the phoenix-client.jar file on
> Spark's extraClasspath is built with proper Spark dependency version.
>
> Thanks,
> Youngwoo
>
>
> On Thu, Jan 10, 2019 at 8:23 AM talluri abhishek <
> abhishektall...@gmail.com> wrote:
>
>> Thanks Youngwoo.
>> If that's the case how do we handle an upgrade from one spark version to
>> another? Do we have to rebuild the Phoenix and thereby re-insert all the
>> data?
>> Or is there a better way to handle this.
>>
>> Thanks
>> Abhishek.
>>
>>
>> On Wed, Jan 9, 2019 at 6:09 PM Youngwoo Kim (김영우) 
>> wrote:
>>
>>> Hi Abhishek,
>>>
>>> Yes. You need to build the phoenix packages with proper Spark dependency
>>> to run on Spark 2.x E.g., mvn clean package -Dspark.version=2.0.0
>>>
>>> And also, there is another profile for Spark 1.6 See
>>> https://github.com/apache/phoenix/blob/4.x-HBase-1.4/pom.xml#L1071
>>>
>>> So, if you would like to run the spark plugin on your spark distro, you
>>> shuld build the phoenix package with appropriate spark dependency.
>>>
>>> HTH,
>>> Youngwoo
>>>
>>> 2019년 1월 10일 (목) 오전 6:32, talluri abhishek 님이
>>> 작성:
>>>
>>>> Hi All,
>>>>
>>>> When trying to use phoenix-spark plugin with spark 2.3, we are getting
>>>> a NoClassDefFoundError and it is similar to what is found in the below JIRA
>>>> https://issues.apache.org/jira/browse/PHOENIX-
>>>>
>>>> Using a phoenix 4.14-cdh5.14 parcel alongside 2.3.0.cloudera3 spark
>>>> release to reproduce this issue. This plugin works fine when using the
>>>> spark 1.6 that comes with the distribution but not the spark 2.3 version
>>>> installed through a csd/parcel. Is there any way to make it work with the
>>>> given phoenix 4.14 version and spark 2.3 version? (Tried passing in the
>>>> phoenix-client jar and the phoenix-spark jar with the driver and executor
>>>> extraClassPath.)
>>>>
>>>> I could see that phoenix 4.14-cdh5.14 is built using the
>>>> cdh-spark-version which is 1.6, do we need to build against the required
>>>> spark version in order to make it work or am I missing something? The
>>>> docs/JIRA seem to suggest that 4.10 version should be compatible with spark
>>>> 1.3.1+ versions.
>>>>
>>>> Any help would be appreciated.
>>>>
>>>> Thanks,
>>>> Abhishek
>>>>
>>>


Re: phoenix-spark plugin with Spark 2.3

2019-01-09 Thread talluri abhishek
Thanks Youngwoo.
If that's the case how do we handle an upgrade from one spark version to
another? Do we have to rebuild the Phoenix and thereby re-insert all the
data?
Or is there a better way to handle this.

Thanks
Abhishek.


On Wed, Jan 9, 2019 at 6:09 PM Youngwoo Kim (김영우)  wrote:

> Hi Abhishek,
>
> Yes. You need to build the phoenix packages with proper Spark dependency
> to run on Spark 2.x E.g., mvn clean package -Dspark.version=2.0.0
>
> And also, there is another profile for Spark 1.6 See
> https://github.com/apache/phoenix/blob/4.x-HBase-1.4/pom.xml#L1071
>
> So, if you would like to run the spark plugin on your spark distro, you
> shuld build the phoenix package with appropriate spark dependency.
>
> HTH,
> Youngwoo
>
> 2019년 1월 10일 (목) 오전 6:32, talluri abhishek 님이
> 작성:
>
>> Hi All,
>>
>> When trying to use phoenix-spark plugin with spark 2.3, we are getting
>> a NoClassDefFoundError and it is similar to what is found in the below JIRA
>> https://issues.apache.org/jira/browse/PHOENIX-
>>
>> Using a phoenix 4.14-cdh5.14 parcel alongside 2.3.0.cloudera3 spark
>> release to reproduce this issue. This plugin works fine when using the
>> spark 1.6 that comes with the distribution but not the spark 2.3 version
>> installed through a csd/parcel. Is there any way to make it work with the
>> given phoenix 4.14 version and spark 2.3 version? (Tried passing in the
>> phoenix-client jar and the phoenix-spark jar with the driver and executor
>> extraClassPath.)
>>
>> I could see that phoenix 4.14-cdh5.14 is built using the
>> cdh-spark-version which is 1.6, do we need to build against the required
>> spark version in order to make it work or am I missing something? The
>> docs/JIRA seem to suggest that 4.10 version should be compatible with spark
>> 1.3.1+ versions.
>>
>> Any help would be appreciated.
>>
>> Thanks,
>> Abhishek
>>
>


phoenix-spark plugin with Spark 2.3

2019-01-09 Thread talluri abhishek
Hi All,

When trying to use phoenix-spark plugin with spark 2.3, we are getting
a NoClassDefFoundError and it is similar to what is found in the below JIRA
https://issues.apache.org/jira/browse/PHOENIX-

Using a phoenix 4.14-cdh5.14 parcel alongside 2.3.0.cloudera3 spark release
to reproduce this issue. This plugin works fine when using the spark 1.6
that comes with the distribution but not the spark 2.3 version installed
through a csd/parcel. Is there any way to make it work with the
given phoenix 4.14 version and spark 2.3 version? (Tried passing in the
phoenix-client jar and the phoenix-spark jar with the driver and executor
extraClassPath.)

I could see that phoenix 4.14-cdh5.14 is built using the cdh-spark-version
which is 1.6, do we need to build against the required spark version in
order to make it work or am I missing something? The docs/JIRA seem to
suggest that 4.10 version should be compatible with spark 1.3.1+ versions.

Any help would be appreciated.

Thanks,
Abhishek


Regarding upgrading from 4.7 to 4.14

2018-11-14 Thread talluri abhishek
Hi All,

We are upgrading from Phoenix 4.7 to 4.14 and observed that data is not
directly available in Phoenix after the upgrade, though the underlying
old hbase tables still hold the data.
Is it because of the column name encoding that's introduced after 4.8 and
Is there any easier way to migrate the data from older versions to a newer
version (>= 4.8 and < 4.14) without having to re-insert data?
I am using CDH parcels to upgrade and do we still need to upgrade to a
maximum of two version as stated in http://phoenix.apache.org/upgrading.html
or Is it okay to directly upgrade to 4.14? Any known issues in doing so?

Thanks,
Abhishek