Re: Ask performance advice

2017-04-06 Thread NaHeon Kim
Thank you all.

1.
The other query is:

select keyword, sum(cnt) as count
from my_table
where regymdt >= to_timestamp('2017-04-03 00:00') and regymdt <=
to_timestamp('2017-04-03 23:00')
and *obj_id in ( obj_ids comes here... )*
and obj_grp = '100'
group by keyword
order by count desc;

This query is pretty fast because of the first two PKs.
The reason obj_id comes at second PK is, we first assumed 'select with
obj_id' will be called much more often, and then 'select with obj_grp' is
also required.

2.
4.8.0-HBase-1.1 version is used.
I'm testing on HBase server for our product, so that upgrading looks not
easy. : (
If there definitely exist index bug fixes, I could try though.

3.
With Jonathan's reply and optional PK, this schema looks worthy to try:
create table my_table (
   obj_grp varchar(50),
   obj_id varchar(50) not null,
   regymdt timestamp not null,
   keyword varchar not null,
   cnt integer,
   post_tp varchar(20)
   constraint pk primary key (obj_grp, obj_id, regymdt, keyword)
)

Thanks,
NaHeon




2017-04-05 15:17 GMT+09:00 Jonathan Leech <jonat...@gmail.com>:

> Also, your schema as defined seems to have a pretty common hbase
> anti-pattern, using a steadily increasing value as the first part of the
> rowkey (assuming your data is also loaded in order of time). This will lead
> to region hotspotting on the load, region splitting, etc. Additionally,
> your queries may not take advantage of the parallelism of the cluster if
> they end up on a small number of regions. If the queries are more likely to
> be on the most recent data, as is often the case, you'll end up with a
> single server doing the bulk the work in the cluster.
>
> - Jonathan
>
> On Apr 4, 2017, at 10:10 PM, James Taylor <jamestay...@apache.org> wrote:
>
> What other queries do you want to be fast? What version of Phoenix are you
> using? Have you seen our new Tuning Guide [1]?
>
> You could try moving OBJ_ID to end of PK and adding OBJ_GRP as the
> leading PK column (it's fine to include it in PK even if it's optional).
>
> Your index should be used, though. Can you try doing an explain on the
> query with 4.10 and file a  JIRA if the index isn't being used?
>
> Thanks,
> James
>
> [1] http://phoenix.apache.org/tuning_guide.html
>
> On Tue, Apr 4, 2017 at 8:01 PM NaHeon Kim <honey.and...@gmail.com> wrote:
>
>> Hi all,
>>
>> My team has a Phoenix table containing over 30,000,000 rows and try to
>> speed up its query performance.
>> It'll be perfect if all queries could be done within 1~2 seconds.
>>
>> - table schema -
>>
>> CREATE TABLE MY_TABLE (
>>REGYMDT timestamp not null,
>>OBJ_ID varchar(50) not null,
>>KEYWORD varchar not null,
>>OBJ_GRP  varchar(50),
>>CNT integer,
>>POST_TP varchar(20),
>>CONSTRAINT PK PRIMARY KEY (REGYMDT, OBJ_ID, KEYWORD)
>> ) IMMUTABLE_ROWS = true;
>>
>> create index objgrp_reg_kwd on my_table(obj_grp, regymdt, keyword)
>> include (cnt);
>>
>> - tuning-needed query -
>>
>> select keyword, sum(cnt) as count
>> from my_table
>> where regymdt >= to_timestamp('2017-04-03 00:00') and regymdt <=
>> to_timestamp('2017-04-03 23:00')
>> and obj_grp = '100'
>> group by keyword
>> order by count desc;
>>
>>
>> The reason why obj_grp isn't PK is that it's optional value.
>> I could use obj_grp's default value and propagate it to PK, but still not
>> think it's desired pattern.
>>
>> The query above takes more than 20 seconds!
>> It does *not use* objgrp_reg_kwd index, doing just regymdt range scan.
>>
>> Thanks in advance!
>> NaHeon
>>
>>


Ask performance advice

2017-04-04 Thread NaHeon Kim
Hi all,

My team has a Phoenix table containing over 30,000,000 rows and try to
speed up its query performance.
It'll be perfect if all queries could be done within 1~2 seconds.

- table schema -

CREATE TABLE MY_TABLE (
   REGYMDT timestamp not null,
   OBJ_ID varchar(50) not null,
   KEYWORD varchar not null,
   OBJ_GRP varchar(50),
   CNT integer,
   POST_TP varchar(20),
   CONSTRAINT PK PRIMARY KEY (REGYMDT, OBJ_ID, KEYWORD)
) IMMUTABLE_ROWS = true;

create index objgrp_reg_kwd on my_table(obj_grp, regymdt, keyword) include
(cnt);

- tuning-needed query -

select keyword, sum(cnt) as count
from my_table
where regymdt >= to_timestamp('2017-04-03 00:00') and regymdt <=
to_timestamp('2017-04-03 23:00')
and obj_grp = '100'
group by keyword
order by count desc;


The reason why obj_grp isn't PK is that it's optional value.
I could use obj_grp's default value and propagate it to PK, but still not
think it's desired pattern.

The query above takes more than 20 seconds!
It does *not use* objgrp_reg_kwd index, doing just regymdt range scan.

Thanks in advance!
NaHeon


Re: ROW_TIMESTAMP weird behaviour

2017-03-20 Thread NaHeon Kim
Hi,

Does ‘phoenix.query.dateFormatTimeZone’ work as expected?

Though dateFormatTimeZone option and HBase servers are set to UTC+9 timezone,
UTC+9 based timestamp values are still hidden.
What I expect is that those values are available immediately right after upsert.

phoenix-4.8.0-HBase-1.1 is used! : )

Regards,
NaHeon

On 2017-02-07 23:45 (+0900), Ankit Singhal  wrote: 
> It's actually getting added but may be due to timezone difference, your> 
> values are going in future.> 
> You can set the local time zone by setting phoenix.query.dateFormatTimeZone> 
> https://phoenix.apache.org/tuning.html> 
> 
> On Tue, Feb 7, 2017 at 6:34 PM, Dhaval Modi  wrote:> 
> 
> > Thanks Ankit.> 
> >> 
> > My issue is relevant to PHOENIX-3176.> 
> >> 
> > But additional observation is, any timestamp value after 13:oo hours of> 
> > the same day is not added.> 
> >> 
> > 0: jdbc:phoenix:> select * from DUMMY;> 
> > +--+> 
> > |  XXX_TS  |> 
> > +--+> 
> > | 2017-01-01 15:02:21.050  |> 
> > | 2017-01-02 15:02:21.050  |> 
> > | 2017-01-13 15:02:21.050  |> 
> > | 2017-02-06 15:02:21.050  |> 
> > | 2017-02-07 11:02:21.050  |> 
> > | 2017-02-07 11:03:21.050  |> 
> > | 2017-02-07 12:02:21.050  |> 
> > +--+> 
> > 7 rows selected (0.044 seconds)> 
> > 0: jdbc:phoenix:> upsert into DUMMY values('2017-02-07T*12:03:21.050'*);> 
> > 1 row affected (0.01 seconds)> 
> > 0: jdbc:phoenix:> select * from DUMMY;> 
> > +--+> 
> > |  XXX_TS  |> 
> > +--+> 
> > | 2017-01-01 15:02:21.050  |> 
> > | 2017-01-02 15:02:21.050  |> 
> > | 2017-01-13 15:02:21.050  |> 
> > | 2017-02-06 15:02:21.050  |> 
> > | 2017-02-07 11:02:21.050  |> 
> > | 2017-02-07 11:03:21.050  |> 
> > | 2017-02-07 12:02:21.050  |> 
> > *| 2017-02-07 12:03:21.050  |*> 
> > +--+> 
> > 8 rows selected (0.047 seconds)> 
> > 0: jdbc:phoenix:> upsert into DUMMY values('2017-02-07T*13:03:21.050*');> 
> > 1 row affected (0.009 seconds)> 
> > 0: jdbc:phoenix:> select * from DUMMY;> 
> > +--+> 
> > |  XXX_TS  |> 
> > +--+> 
> > | 2017-01-01 15:02:21.050  |> 
> > | 2017-01-02 15:02:21.050  |> 
> > | 2017-01-13 15:02:21.050  |> 
> > | 2017-02-06 15:02:21.050  |> 
> > | 2017-02-07 11:02:21.050  |> 
> > | 2017-02-07 11:03:21.050  |> 
> > | 2017-02-07 12:02:21.050  |> 
> > | 2017-02-07 12:03:21.050  |> 
> > +--+> 
> > 8 rows selected (0.04 seconds)> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> > Regards,> 
> > Dhaval Modi> 
> > dhavalmod...@gmail.com> 
> >> 
> > On 7 February 2017 at 15:28, Ankit Singhal > 
> > wrote:> 
> >> 
> >> I think you are also hitting https://issues.apache.> 
> >> org/jira/browse/PHOENIX-3176.> 
> >>> 
> >> On Tue, Feb 7, 2017 at 2:18 PM, Dhaval Modi > 
> >> wrote:> 
> >>> 
> >>> Hi Pedro,> 
>  
> >>> Upserted key are different. One key is for July month & other for> 
> >>> January month.> 
> >>> 1. '2017-*07*-02T15:02:21.050'> 
> >>> 2. '2017-*01*-02T15:02:21.050'> 
>  
>  
> >>> Regards,> 
> >>> Dhaval Modi> 
> >>> dhavalmod...@gmail.com> 
>  
> >>> On 7 February 2017 at 13:18, Pedro Boado  wrote:> 
>  
>  Hi.> 
> > 
>  I don't think it's weird. That column is PK and you've upserted twice> 
>  the same key value so first one is inserted and second one is updated.> 
> > 
>  Regards.> 
> > 
> > 
> > 
>  On 7 Feb 2017 04:59, "Dhaval Modi"  wrote:> 
> > 
> > Hi All,> 
> >> 
> > I am facing abnormal scenarios with ROW_TIMESTAMP.> 
> >> 
> > I created table in Phoenix as below:> 
> > CREATE TABLE DUMMY(XXX_TS TIMESTAMP NOT NULL CONSTRAINT pk PRIMARY KEY> 
> > (XXX_TS ROW_TIMESTAMP))> 
> > where "XXX_TS" is used as ROW_TIMESTAMP.> 
> >> 
> > Now, I am trying to add data:> 
> > upsert into DUMMY values('2017-07-02T15:02:21.050');> 
> > upsert into DUMMY values('2017-01-02T15:02:21.050');> 
> >> 
> > I am only seeing one entry.> 
> > *==*> 
> > *0: jdbc:phoenix:> select * from DUMMY;*> 
> > *+--+*> 
> > *|  XXX_TS  |*> 
> > *+--+*> 
> > *| 2017-01-02 15:02:21.050  |*> 
> > *+--+*> 
> > *1 row selected (0.039 seconds)*> 
> > *==*> 
> >> 
> >> 
> > Additional info:> 
> > System date of HBase & Phoenix: mar feb  7 05:57:37 CET 2017> 
> >> 
> >> 
> > Regards,> 
> > Dhaval Modi> 
> > dhavalmod...@gmail.com> 
> >> 
> > 
>  
> >>> 
> >> 
> 

Re: write Dataframe to phoenix

2017-03-20 Thread NaHeon Kim
Did you check your project has dependency on phoenix-spark jar? : )
See Spark setup at http://phoenix.apache.org/phoenix_spark.html

Regards,
NaHeon

2017-03-20 15:31 GMT+09:00 Sateesh Karuturi :

>
> I am trying to write Dataframe to Phoenix.
>
> Here is my code:
>
>
>1. df.write.format("org.apache.phoenix.spark").mode(SaveMode.Overwrite
>).options(collection.immutable.Map(
>2. "zkUrl" -> "localhost:2181/hbase-unsecure",
>3. "table" -> "TEST")).save();
>
> and i am getting following exception:
>
>
>1. org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 3.0 (TID 411, ip-x-xx-xxx.ap-southeast-1.compute.internal): 
> java.lang.RuntimeException: java.sql.SQLException: No suitable driver found 
> for jdbc:phoenix:localhost:2181:/hbase-unsecure;
>2. at 
> org.apache.phoenix.mapreduce.PhoenixOutputFormat.getRecordWriter(PhoenixOutputFormat.java:58)
>3. at 
> org.apache.spark.rdd.PairRDDFunctions$anonfun$saveAsNewAPIHadoopDataset$1$anonfun$12.apply(PairRDDFunctions.scala:1030)
>4. at 
> org.apache.spark.rdd.PairRDDFunctions$anonfun$saveAsNewAPIHadoopDataset$1$anonfun$12.apply(PairRDDFunctions.scala:1014)
>5. at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>6. at org.apache.spark.scheduler.Task.run(Task.scala:88)
>7. at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>8. at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>9. at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>


Re: Getting Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/phoenix/jdbc/PhoenixDriver Exception

2017-03-16 Thread NaHeon Kim
Hi,
I think you need to include phoenix-client jar, instead of
phoenix-spark.jar. :-)

2017-03-16 16:44 GMT+09:00 Sateesh Karuturi :

> Hello folks..,
>
> i am trying to run sample phoenix spark application, while i am trying to
> run i am getting following exception:
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/phoenix/jdbc/PhoenixDriver
>
>
> Here is my sample code:
>
>
> package com.inndata.spark.sparkphoenix;
>
>
> import org.apache.spark.SparkConf;
>
> import org.apache.spark.SparkContext;
>
> import org.apache.spark.api.java.JavaSparkContext;
>
> import org.apache.spark.sql.DataFrame;
>
> import org.apache.spark.sql.SQLContext;
>
>
> import com.google.common.collect.ImmutableMap;
>
>
> import java.io.Serializable;
>
>
> /**
>
>  *
>
>  */
>
> public class SparkConnection implements Serializable {
>
>
> public static void main(String args[]) {
>
> SparkConf sparkConf = new SparkConf();
>
> sparkConf.setAppName("spark-phoenix-df");
>
> sparkConf.setMaster("local[*]");
>
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
>
> SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
>
>
> /*DataFrame df = sqlContext.read()
>
> .format("org.apache.phoenix.spark")
>
> .option("table", "TABLE1")
>
> .option("zkUrl", "localhost:2181")
>
> .load();
>
> df.count();*/
>
>
>
> DataFrame fromPhx = sqlContext.read().format("jdbc")
>
> .options(ImmutableMap.of("driver", "org.apache.phoenix.jdbc.PhoenixDriver",
> "url",
>
> "jdbc:phoenix:ZK_QUORUM:2181:/hbase-secure", "dbtable", "TABLE1"))
>
> .load();
>
>
>
> fromPhx.show();
>
>
> }
>
> }
>
>
> I have included phoenix-spark jar to the spark library and as well as
> spark-submit command. and i also added *spark.executor.extraClassPath
> and **spark.driver.extraClassPath in spark-env.sh*
>


Re: Row timestamp

2017-03-12 Thread NaHeon Kim
According to row timestamp documentation 
(https://phoenix.apache.org/rowtimestamp.html 
), I don’t think you can.
It says one of primary key could be set to row timestamp column.

Regards,
NaHeon

On 2017-03-11 22:34 (+0900), Batyrshin Alexander <0...@gmail.com> wrote: 
> So main idea behind of "Row Timestamp" feature is to give ability to set 
> HBase cell timestamp via UPSERT?> 
> Is it possible to get cell timestamp for already created HBase table with row 
> keys without timestamp?> 
> 
> As for example. I tried to execute query from page:> 
> 
> 0: jdbc:phoenix:> CREATE TABLE DESTINATION_METRICS_TABLE (CREATED_DATE NOT 
> NULL DATE, METRIC_ID NOT NULL CHAR(15), METRIC_VALUE LONG CONSTRAINT PK 
> PRIMARY KEY(CREATED_DATE ROW_TIMESTAMP, METRIC_ID)) SALT_BUCKETS = 8;> 
> Error: ERROR 601 (42P00): Syntax error. Encountered "NOT" at line 1, column 
> 54. (state=42P00,code=601)> 
> 
> Fixed query is: CREATE TABLE DESTINATION_METRICS_TABLE (CREATED_DATE DATE NOT 
> NULL , METRIC_ID CHAR(15) NOT NULL , METRIC_VALUE UNSIGNED_LONG CONSTRAINT PK 
> PRIMARY KEY(CREATED_DATE ROW_TIMESTAMP, METRIC_ID)) SALT_BUCKETS = 8;> 
> 
> > On 10 Mar 2017, at 19:39, Samarth Jain  wrote:> 
> > > 
> > This is because you are using now() for created. If you used a different 
> > date then with TEST_ROW_TIMESTAMP1, the cell timestamp would be that date 
> > where as with TEST_ROW_TIMESTAMP2 it would be the server side time.> 
> > > 
> > Also, which examples are broken on the page?> 
> > > 
> > On Thu, Mar 9, 2017 at 11:28 AM, Batyrshin Alexander <0x62...@gmail.com 
> > > wrote:> 
> >  Hello,> 
> > Im trying to understand what excatly Phoenix row timestamp is> 
> > I created 2 tables for test:> 
> > > 
> > CREATE TABLE test_row_timestamp1(> 
> > id varchar NOT NULL,> 
> > created TIMESTAMP NOT NULL,> 
> > foo varchar,> 
> > CONSTRAINT PK PRIMARY KEY( id, created ROW_TIMESTAMP )> 
> > )> 
> > > 
> > CREATE TABLE test_row_timestamp2(> 
> > id varchar NOT NULL,> 
> > created TIMESTAMP NOT NULL,> 
> > foo varchar,> 
> > CONSTRAINT PK PRIMARY KEY( id, created )> 
> > )> 
> > > 
> > upsert into test_row_timestamp1 (id, created, foo) values ('1', now(), 
> > 'bar');> 
> > upsert into test_row_timestamp2 (id, created, foo) values ('1', now(), 
> > 'bar');> 
> > > 
> > And result is:> 
> > > 
> > hbase(main):004:0> scan 'TEST_ROW_TIMESTAMP1', { LIMIT=>10}> 
> > ROW  
> > COLUMN+CELL> 
> >  1\x00\x80\x00\x01Z\xB4\x80:6\x00\x00\x00\x00
> > column=0:FOO, timestamp=1489086986806, value=bar> 
> >  1\x00\x80\x00\x01Z\xB4\x80:6\x00\x00\x00\x00
> > column=0:_0, timestamp=1489086986806, value=x> 
> > > 
> > hbase(main):005:0> scan 'TEST_ROW_TIMESTAMP2', { LIMIT=>10}> 
> > ROW  
> > COLUMN+CELL> 
> >  1\x00\x80\x00\x01Z\xB4\x80M\xE6\x00\x00\x00\x00 
> > column=0:FOO, timestamp=1489086991848, value=bar> 
> >  1\x00\x80\x00\x01Z\xB4\x80M\xE6\x00\x00\x00\x00 
> > column=0:_0, timestamp=1489086991848, value=x> 
> > > 
> > Both tables has the same row key pattern id + 0x00 + timestamp> 
> > I expect that test_row_timestamp1 will utilise native hbase timestamp that 
> > is part of "real" hbase key.> 
> > > 
> > > 
> > PS. Examples at https://phoenix.apache.org/rowtimestamp.html 
> >  are broken> 
> > > 
> 
> 

Re: Cannot upsert row_timestamp value

2017-02-27 Thread NaHeon Kim
Thanks alot!!! It works. : )

Rdgards,
NaHeon


2017-02-26 15:33 GMT+09:00 Dhaval Modi <dhavalmod...@gmail.com>:

> Hi NaHeon Kim,
>
> Please refer to mailing list:
> https://lists.apache.org/thread.html/fb747661f535b0a407bf38e6b961a2
> c68634815189c80a7d612366b1@%3Cuser.phoenix.apache.org%3E
>
>
> I also faced similar issue.
>
>
> Regards,
> Dhaval Modi
> dhavalmod...@gmail.com
>
> On 26 February 2017 at 10:34, NaHeon Kim <honey.and...@gmail.com> wrote:
>
>> Hi all,
>>
>> UPSERT into a table with ROW_TIMESTAMP column is not possible.
>> I'm using phoenix-4.8.0-hbase-1.1.
>>
>> Table Schema:
>>
>> create table my_table (
>>obj_id varchar(20) not null,
>>create_dt timestamp not null,
>>keyword varchar(100) not null,
>>count integer
>>constraint pk primary key (obj_id, create_dt row_timestamp, keyword)
>> );
>>
>>
>> 1) Spark Integration
>> No rows are inserted. No errors.
>>
>> 2) sqlline.py - timestamp column is in a query
>> No rows are inserted.
>>
>> upsert into my_table (obj_id, create_dt, keyword, count)
>> values ('objid', '2017-02-26 13:48:00', 'k', 100);
>> 3) sqlline.py - timestamp column is not in a query
>> This throws an exception:
>>
>>> java.lang.ArrayIndexOutOfBoundsException: 8
>>> at org.apache.phoenix.execute.MutationState.getNewRowKeyWithRow
>>> Timestamp(MutationState.java:548)
>>> at org.apache.phoenix.execute.MutationState.generateMutations(M
>>> utationState.java:627)
>>> at org.apache.phoenix.execute.MutationState.addRowMutations(Mut
>>> ationState.java:566)
>>> at org.apache.phoenix.execute.MutationState.send(MutationState.java:908)
>>> at org.apache.phoenix.execute.MutationState.send(MutationState.
>>> java:1329)
>>> at org.apache.phoenix.execute.MutationState.commit(MutationStat
>>> e.java:1161)
>>> at org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConn
>>> ection.java:529)
>>
>>
>> upsert into my_table (obj_id, keyword, count)
>> values ('objid', 'k', 100);
>>
>> Everything works well without row_timestamp.
>> Thanks in advance! : )
>>
>
>


Cannot upsert row_timestamp value

2017-02-25 Thread NaHeon Kim
Hi all,

UPSERT into a table with ROW_TIMESTAMP column is not possible.
I'm using phoenix-4.8.0-hbase-1.1.

Table Schema:

create table my_table (
   obj_id varchar(20) not null,
   create_dt timestamp not null,
   keyword varchar(100) not null,
   count integer
   constraint pk primary key (obj_id, create_dt row_timestamp, keyword)
);


1) Spark Integration
No rows are inserted. No errors.

2) sqlline.py - timestamp column is in a query
No rows are inserted.

upsert into my_table (obj_id, create_dt, keyword, count)
values ('objid', '2017-02-26 13:48:00', 'k', 100);
3) sqlline.py - timestamp column is not in a query
This throws an exception:

> java.lang.ArrayIndexOutOfBoundsException: 8
> at
> org.apache.phoenix.execute.MutationState.getNewRowKeyWithRowTimestamp(MutationState.java:548)
> at
> org.apache.phoenix.execute.MutationState.generateMutations(MutationState.java:627)
> at
> org.apache.phoenix.execute.MutationState.addRowMutations(MutationState.java:566)
> at org.apache.phoenix.execute.MutationState.send(MutationState.java:908)
> at org.apache.phoenix.execute.MutationState.send(MutationState.java:1329)
> at org.apache.phoenix.execute.MutationState.commit(MutationState.java:1161)
> at
> org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:529)


upsert into my_table (obj_id, keyword, count)
values ('objid', 'k', 100);

Everything works well without row_timestamp.
Thanks in advance! : )


Re: Select date range performance issue

2017-02-24 Thread NaHeon Kim
Jonathan,

So much helpful explanations!
Skip scan and index including any column in a query make a big difference.
I'll try row timestamp more and see what happens.

Thanks,
NaHeon

2017-02-24 13:25 GMT+09:00 Jonathan Leech <jonat...@gmail.com>:

> 1. No, I am not confused. A skip scan would "skip" over entire ranges of
> obj_id and all create_dt values for it. This will only be effective if
> there are many less distinct values of obj_id than there are total rows. If
> there are too many distinct obj_ids then it either wont speed the query up
> at all, or not enough, but it's simple to try it and see.
>
> 2. Your index isnt used because it doesn't contain the other columns used
> in the query; e.g your query is isn't "covered". You get the column(s)
> defined in the index + anything in the rowkey. You can also use the
> "include" keyword to add other columns to the index. Alternatively, you can
> look at "local" indexes, or it may be possible to nest the query with a
> sub-select to fetch the desired primary key values from the index by
> create_dt and the others from the main table.
>
> 3. No, not all. Phoenix will assign the the internal hbase timestamp of
> the row to whatever you set to create_dt. It can also automatically set it
> to the current time when you create the row, if you want it to. This has
> other implications; e.g if you set a TTL, versions, etc in hbase. It can
> speed up queries, especially those that execute on the most recent data
> written, but prior to hbase compaction. Advanced stuff and performance is
> highly dependent on your specific use case and hbase compaction settings...
>
> On Feb 23, 2017, at 7:59 PM, NaHeon Kim <honey.and...@gmail.com> wrote:
>
> Thanks for suggestion.
>
> Here's further questions:
> 1. create_dt (not obj_id, I think you confused) would have large sets of
> date, so SKIP_SCAN hint might be not useful.
>
> 2. I created secondary index on create_dt
>create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT;
>
> However, EXPLAIN still shows query plan of FULL SCAN.
> Giving index hint on SELECT doesn't work as well.
>
> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
> NEWS_KEYWORD_COUNT
>SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
> CLIENT MERGE SORT
> CLIENT 100 ROW LIMIT
>
> 3. ROW_TIMESTAMP is time of current query execution time, right?
> Then it's not a right choice. :-(
>
>
> 2017-02-24 1:54 GMT+09:00 Jonathan Leech <jonat...@gmail.com>:
>
>> If there are not a large number of distinct values of obj_id, try a
>> SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's
>> actually used via explain. Finally, you might try the ROW_TIMESTAMP feature
>> if it fits your use case.
>>
>> On Feb 22, 2017, at 11:30 PM, NaHeon Kim <honey.and...@gmail.com> wrote:
>>
>> Hi all,
>> I've seen performance problem when selecting rows within date range.
>>
>> My table schema is:
>>
>> CREATE TABLE MY_TABLE (
>>OBJ_ID varchar(20) not null,
>>CREATE_DT timestamp not null,
>>KEYWORD varchar(100) not null,
>>COUNT integer,
>>CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
>> );
>>
>> MY_TABLE has almost 5,200,000 rows,
>> CREATE_DT has about 6 months range.
>>
>> And executed query:
>>
>> SELECT KEYWORD, SUM(COUNT)
>> FROM MY_TABLE
>> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
>> AND CREATE_DT < to_timestamp('2016-04-01 00:00')
>> GROUP BY KEYWORD;
>>
>> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of
>> row key.
>> I created a secondary index on CREATE_DT but there's no improvement.
>>
>> Query plan looks weird:
>> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
>> NEWS_KEYWORD_COUNT
>>  SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000'
>> AND CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>>  SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
>> CLIENT MERGE SORT
>> CLIENT 100 ROW LIMIT
>>
>> BUT If CREATE_DT comes first of row key, plan says range scan will be
>> done.
>>
>> Any suggestion? : )
>>
>> Thanks,
>> NaHeon
>>
>>
>


Re: Select date range performance issue

2017-02-23 Thread NaHeon Kim
Thanks for suggestion.

Here's further questions:
1. create_dt (not obj_id, I think you confused) would have large sets of
date, so SKIP_SCAN hint might be not useful.

2. I created secondary index on create_dt
   create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT;

However, EXPLAIN still shows query plan of FULL SCAN.
Giving index hint on SELECT doesn't work as well.

CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
NEWS_KEYWORD_COUNT
   SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
   SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
CLIENT MERGE SORT
CLIENT 100 ROW LIMIT

3. ROW_TIMESTAMP is time of current query execution time, right?
Then it's not a right choice. :-(


2017-02-24 1:54 GMT+09:00 Jonathan Leech <jonat...@gmail.com>:

> If there are not a large number of distinct values of obj_id, try a
> SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's
> actually used via explain. Finally, you might try the ROW_TIMESTAMP feature
> if it fits your use case.
>
> On Feb 22, 2017, at 11:30 PM, NaHeon Kim <honey.and...@gmail.com> wrote:
>
> Hi all,
> I've seen performance problem when selecting rows within date range.
>
> My table schema is:
>
> CREATE TABLE MY_TABLE (
>OBJ_ID varchar(20) not null,
>CREATE_DT timestamp not null,
>KEYWORD varchar(100) not null,
>COUNT integer,
>CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
> );
>
> MY_TABLE has almost 5,200,000 rows,
> CREATE_DT has about 6 months range.
>
> And executed query:
>
> SELECT KEYWORD, SUM(COUNT)
> FROM MY_TABLE
> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
> AND CREATE_DT < to_timestamp('2016-04-01 00:00')
> GROUP BY KEYWORD;
>
> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row
> key.
> I created a secondary index on CREATE_DT but there's no improvement.
>
> Query plan looks weird:
> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
> NEWS_KEYWORD_COUNT
>  SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>  SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
> CLIENT MERGE SORT
> CLIENT 100 ROW LIMIT
>
> BUT If CREATE_DT comes first of row key, plan says range scan will be done.
>
> Any suggestion? : )
>
> Thanks,
> NaHeon
>
>


Select date range performance issue

2017-02-22 Thread NaHeon Kim
Hi all,
I've seen performance problem when selecting rows within date range.

My table schema is:

CREATE TABLE MY_TABLE (
   OBJ_ID varchar(20) not null,
   CREATE_DT timestamp not null,
   KEYWORD varchar(100) not null,
   COUNT integer,
   CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
);

MY_TABLE has almost 5,200,000 rows,
CREATE_DT has about 6 months range.

And executed query:

SELECT KEYWORD, SUM(COUNT)
FROM MY_TABLE
WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
AND CREATE_DT < to_timestamp('2016-04-01 00:00')
GROUP BY KEYWORD;

It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row
key.
I created a secondary index on CREATE_DT but there's no improvement.

Query plan looks weird:
CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
NEWS_KEYWORD_COUNT
 SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
 SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
CLIENT MERGE SORT
CLIENT 100 ROW LIMIT

BUT If CREATE_DT comes first of row key, plan says range scan will be done.

Any suggestion? : )

Thanks,
NaHeon


Phoenix timezone not changed

2017-02-21 Thread NaHeon Kim
Hi all,

I have a Spark app which inserts data to HBase, using Phoenix.
I also use Ambari and Squirrel SQL.

I added phoenix.query.dateFormatTimeZone option to custom hbase-site.xml of
Ambari, but timezone never be changed.
KST, GMT+9:00, ... several format tried but all do not have affect.

Thanks in advance!