Re: Rolling hourly data

2018-11-27 Thread Monil Gandhi
Thanks much for the response. I don’t think Phoenix does roll ups or at
least the version we are using. We are constrained by emr on aws which is
on 4.14

Thomas - what I meant was I want to write a mr job or “update value” query
for all of the sids and there are about 200k different values. It would be
very easy of I can write a MR job to select all rows in a day and just
reduce it to one row in a different table based on the 3 primary keys. But
because day is the last part of primary key, I keep getting errors or the
MR job just takes a long time.

One thing I forgot to mention is that we do not need hourly granularity of
data during query time. It is purely for “real” time ingestion and getting
the latest data in the system.
Hence at this point I am thinking I will generate daily data in a different
offline pipeline and just keep on updating the same row during ingestion.
On Tue, Nov 27, 2018 at 12:19 AM Jonathan Leech  wrote:

> I would try writing the hourly values as 24 columns in a daily row, or as
> an array type.
>
>  I’m not up to speed on the latest Phoenix features, but if it could
> update a daily sum on the fly that might be ok. If that doesn’t exist yet
> or isn’t performant, it could be done in an Hbase coprocessor.
>
> I would also put the daily sum column in a separate column family in the
> same table, rather than a separate table.
>
> - Jonathan
>
> On Nov 26, 2018, at 5:53 PM, Thomas D'Silva 
> wrote:
>
> The 1-day aggregate query you are running should work, you might have to
> increase the client thread pool and queue size.
> See
> http://mail-archives.apache.org/mod_mbox/phoenix-user/201607.mbox/%3c577d338d.2080...@gmail.com%3E
>
> On Tue, Nov 20, 2018 at 9:26 PM Monil Gandhi  wrote:
>
>> Thanks for getting back.
>> The query we run is always an aggregation query. The date range is
>> anytime current to 2 years back.
>> I was looking into UPSERT SELECT, but there does not seem to be an easy
>> way to run it. For instance I thought running something as below would be
>> optimal
>>
>> UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach)
>> VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01',
>> '-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM
>> PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', '-MM-dd', 'GMT')
>> AND TO_DATE('2018-07-02', '-MM-dd', 'GMT') GROUP BY segment_id,
>> cross_segment_id)
>>
>> Hence I tried running the select statement to see if it works. I started
>> to see below error
>> Error: Task
>> org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580
>> rejected from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool
>> size = 128, active threads = 128, queued tasks = 5000, completed tasks =
>> 5153] (state=08000,code=101)
>>
>> When I changed the SELECT query to include a particular sid, the upsert
>> select worked.
>> Hence I think the only way would be for me to run UPSERt for generating
>> daily data for range of sids or segment_id.
>>
>> Did I miss something?
>>
>> On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva 
>> wrote:
>>
>>> Since your PK already leads with (sid, day) I don't think adding a
>>> secondary index will help. Do you generally always run the aggregation
>>> query for the recently inserted data? The row timestamp feature might help
>>> in this case
>>> https://phoenix.apache.org/rowtimestamp.html
>>> If you run the same aggregate queries multiple times then another
>>> approach is to store the coarser daily aggregated data in a separate table
>>> that you can populate using an UPSERT SELECT.
>>> I'm not sure why the explain plan you attached has a CLIENT MERGE SORT,
>>> since you don't have an order by.
>>>
>>> On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi  wrote:
>>>
 Here it is
 CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4
 [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420
 SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
 CLIENT MERGE SORT

 Note: we have a dedicated phoenix query server

 On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby 
 wrote:

> Monil,
>
> Could you please post the results of an EXPLAIN plan of your query?
> For directions how to do this please see
> http://phoenix.apache.org/explainplan.html
>
> Geoffrey Jacoby
>
> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi 
> wrote:
>
>> Hello,
>> Currently we have hourly data in our phoenix table. However, the
>> schema was designed to perform well for daily data. Increasing the number
>> of rows by 24X has lead to degradation of our service over time.
>> Our current schema is as follows
>>
>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT
>> NULL, cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY 
>> (sid,
>> day, cid) ) COMPRESSION='SNAPPY'
>>
>> The query we run is something along

Re: client does not have phoenix.schema.isNamespaceMappingEnabled

2018-11-27 Thread Josh Elser
To add a non-jar file to the classpath of a Java application, you must 
add the directory containing that file to the classpath.


Thus, the following is wrong: 
HADOOP_CLASSPATH=/usr/hdp/3.0.1.0-187/hbase/lib/hbase-protocol.jar:/etc/hbase/3.0.1.0-187/0/hbase-site.xml


And should be: 
HADOOP_CLASSPATH=/usr/hdp/3.0.1.0-187/hbase/lib/hbase-protocol.jar:/etc/hbase/3.0.1.0-187/0/


Most times, including the output of `hbase mapredcp` is sufficient ala

HADOOP_CLASSPATH="$(hbase mapredcp)" hadoop jar ...

On 11/27/18 10:48 AM, M. Aaron Bossert wrote:

Folks,

I have, I believe, followed all the directions for turning on namespace 
mapping as well as extra steps to (added classpath) required to use the 
mapreduce bulk load utility, but am still running into this error...I am 
running a Hortonworks cluster with both HDP v 3.0.1 and HDF components.  
Here is what I have tried:


  * Checked that the proper hbase-site.xml (in my case:
/etc/hbase/3.0.1.0-187/0/hbase-site.xml) file is being referenced
when launching the mapreduce utility:


     ...




phoenix.schema.isNamespaceMappingEnabled

true





phoenix.schema.mapSystemTablesToNamespace

true




     ...

  * added the appropriate classpath additions to the hadoop jar command
(zookeeper quorum hostnames changed to remove my corporate network
info as well as data directory):

HADOOP_CLASSPATH=/usr/hdp/3.0.1.0-187/hbase/lib/hbase-protocol.jar:/etc/hbase/3.0.1.0-187/0/hbase-site.xml 
hadoop jar 
/usr/hdp/3.0.1.0-187/phoenix/phoenix-5.0.0.3.0.1.0-187-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table MYTABLE --input 
/ingest/MYCSV -z zk1,zk2,zk3 -g



...


18/11/27 15:31:48 INFO zookeeper.ReadOnlyZKClient: Close zookeeper 
connection 0x1d58d65f to master-1.punch.datareservoir.net:2181 
,master-2.punch.datareservoir.net:2181 
,master-3.punch.datareservoir.net:2181 



18/11/27 15:31:48 INFO log.QueryLoggerDisruptor: Shutting down 
QueryLoggerDisruptor..


Exception in thread "main" java.sql.SQLException: ERROR 726 
(43M10):Inconsistent namespace mapping properties. Cannot initiate 
connection as SYSTEM:CATALOG is found but client does not have 
phoenix.schema.isNamespaceMappingEnabled enabled


at 
org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)


at 
org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)


at 
org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureTableCreated(ConnectionQueryServicesImpl.java:1113)


at 
org.apache.phoenix.query.ConnectionQueryServicesImpl.createTable(ConnectionQueryServicesImpl.java:1501)


at 
org.apache.phoenix.schema.MetaDataClient.createTableInternal(MetaDataClient.java:2740)


at 
org.apache.phoenix.schema.MetaDataClient.createTable(MetaDataClient.java:1114)


at 
org.apache.phoenix.compile.CreateTableCompiler$1.execute(CreateTableCompiler.java:192)


at 
org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:408)


at 
org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:391)


at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)

at 
org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:390)


at 
org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:378)


at 
org.apache.phoenix.jdbc.PhoenixStatement.executeUpdate(PhoenixStatement.java:1806)


at 
org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:2569)


at 
org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:2532)


at 
org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:76)


at 
org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:2532)


at 
org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:255)


at 
org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:150)


at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:221)

at java.sql.DriverManager.getConnection(DriverManager.java:664)

at java.sql.DriverManager.getConnection(DriverManager.java:208)

at org.apache.phoenix.util.QueryUtil.getConnection(QueryUtil.java:400)

at org.apache.phoenix.util.QueryUtil.getConnection(QueryUtil.java:392)

at 
org.apache.phoenix.mapreduce.AbstractBulkLoadTool.loadData(AbstractBulkLoadTool.java:206)


at 
org.apache.phoenix.mapreduce.AbstractBulkLoadTool.run(AbstractBulkLoadTool.java:180)


at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)

at 
org.apache.phoenix.mapreduce.CsvBulkLoadTool.main(CsvBulkLoadTool.java:109)


at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMeth

Re: JDBC Connection URL to Phoenix on Azure HDInsight

2018-11-27 Thread Josh Elser
Are you trying to use the thick driver (direct HBase connection) or the 
thin driver (via Phoenix Query Server)? You're providing examples of both.


If the thick driver: by default, I believe that HDI configures a root 
znode of "/hbase" not "/hbase-unsecure" which would be a problem. You 
need to pass your ZK quorum which makes your 2nd HDI "example" 
nonsensical. You can read the Phoenix home page for information about 
how to construct a Phoenix JDBC URL for the thick driver at 
https://phoenix.apache.org/#connStr


If you are still having troubles, I'd suggest that you clarify exactly 
what you want to do (rather than everything you tried) and include the 
error you ran into.


On 11/27/18 7:54 AM, Raghavendra Channarayappa wrote:

Dear all,

My current problem is fairly trivial, i guess, but I am currently stuck 
on this.


The following was the JDBC connection string for Phoenix on EMR

jdbc:phoenix:thin:url=http://ec2-12-345-678-90.ap-southeast-1.compute.amazonaws.com:8765;serialization=PROTOBUF;autocommit=true

which worked perfectly fine.

But, for Phoenix on Azure HDInsight(which uses HDP2.6.0), I am unable to 
figure out the equivalent JDBC url


The following options which *DID NOT* work:
*Azure HDInsight Ambari Dashboard URL as the server_url:* 
jdbc:phoenix:thin:url=http://randomcluterdashboard.azurehdinsight.net:8765;serialization=PROTOBUF;autocommit=true
*Phoenix Query Servers(comma-separated) as the server_url:* 
jdbc:phoenix:wn0-apache.x.tx.internal.cloudapp.net 
,wn1-apache.x.tx.internal.cloudapp.net 
,wn2-apache.x.tx.internal.cloudapp.net 
,wn3-apache.x.tx.internal.cloudapp.net:2181:/hbase-unsecure
*Zookeeper Quorum(comma-separated) as the server_url:* 
jdbc:phoenix:zk0-apache.x.tx.internal.cloudapp.net 
,zk1-apache.x.tx.internal.cloudapp.net 
,zk2-apache.x.tx.internal.cloudapp.net:2181:/hbase-unsecure


Where can I find the JDBC url in Ambari dashboard?

Thanks in advance,

Raghavendra






Re: Heap Size Recommendation

2018-11-27 Thread Josh Elser
HBASE_HEAPSIZE is just an environment variable which sets the JVM heap 
size. Your question doesn't make any sense to me.


On 11/16/18 8:44 AM, Azharuddin Shaikh wrote:

Hi All,

We want to improve the read performance of phoenix query for which we 
are trying to upgrade the HBASE_HEAPSIZE.


Currently we have 32 GB memory available on server where default 8 GB 
memory is allocated to HBASE_HEAPSIZE(Default of JVM Heap Size). We want 
to increase HBASE_HEAPSIZE to 16 GB, so do we need to increase JVM Heap 
size as well or there is no need to increase JVM Heap size.


Let us know what is the recommendation for increasing HBASE_HEAPSIZE 
with respect to JVM Heap size. We tried to refer various document but we 
have still not received the exact answer to this question.


Thank you,

Azhar



client does not have phoenix.schema.isNamespaceMappingEnabled

2018-11-27 Thread M. Aaron Bossert
Folks,

I have, I believe, followed all the directions for turning on namespace
mapping as well as extra steps to (added classpath) required to use the
mapreduce bulk load utility, but am still running into this error...I am
running a Hortonworks cluster with both HDP v 3.0.1 and HDF components.
Here is what I have tried:


   - Checked that the proper hbase-site.xml (in my case:
   /etc/hbase/3.0.1.0-187/0/hbase-site.xml) file is being referenced when
   launching the mapreduce utility:


...




  phoenix.schema.isNamespaceMappingEnabled

  true







  phoenix.schema.mapSystemTablesToNamespace

  true




...

   - added the appropriate classpath additions to the hadoop jar command
   (zookeeper quorum hostnames changed to remove my corporate network info as
   well as data directory):

HADOOP_CLASSPATH=/usr/hdp/3.0.1.0-187/hbase/lib/hbase-protocol.jar:/etc/hbase/3.0.1.0-187/0/hbase-site.xml
hadoop jar
/usr/hdp/3.0.1.0-187/phoenix/phoenix-5.0.0.3.0.1.0-187-client.jar
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table MYTABLE --input
/ingest/MYCSV -z zk1,zk2,zk3 -g


...


18/11/27 15:31:48 INFO zookeeper.ReadOnlyZKClient: Close zookeeper
connection 0x1d58d65f to master-1.punch.datareservoir.net:2181,
master-2.punch.datareservoir.net:2181,master-3.punch.datareservoir.net:2181

18/11/27 15:31:48 INFO log.QueryLoggerDisruptor: Shutting down
QueryLoggerDisruptor..

Exception in thread "main" java.sql.SQLException: ERROR 726 (43M10):
Inconsistent
namespace mapping properties. Cannot initiate connection as SYSTEM:CATALOG
is found but client does not have phoenix.schema.isNamespaceMappingEnabled
enabled

at
org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494)

at
org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)

at
org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureTableCreated(ConnectionQueryServicesImpl.java:1113)

at
org.apache.phoenix.query.ConnectionQueryServicesImpl.createTable(ConnectionQueryServicesImpl.java:1501)

at
org.apache.phoenix.schema.MetaDataClient.createTableInternal(MetaDataClient.java:2740)

at
org.apache.phoenix.schema.MetaDataClient.createTable(MetaDataClient.java:1114)

at
org.apache.phoenix.compile.CreateTableCompiler$1.execute(CreateTableCompiler.java:192)

at
org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:408)

at
org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:391)

at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)

at
org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:390)

at
org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:378)

at
org.apache.phoenix.jdbc.PhoenixStatement.executeUpdate(PhoenixStatement.java:1806)

at
org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:2569)

at
org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:2532)

at
org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:76)

at
org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:2532)

at
org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:255)

at
org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:150)

at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:221)

at java.sql.DriverManager.getConnection(DriverManager.java:664)

at java.sql.DriverManager.getConnection(DriverManager.java:208)

at org.apache.phoenix.util.QueryUtil.getConnection(QueryUtil.java:400)

at org.apache.phoenix.util.QueryUtil.getConnection(QueryUtil.java:392)

at
org.apache.phoenix.mapreduce.AbstractBulkLoadTool.loadData(AbstractBulkLoadTool.java:206)

at
org.apache.phoenix.mapreduce.AbstractBulkLoadTool.run(AbstractBulkLoadTool.java:180)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)

at
org.apache.phoenix.mapreduce.CsvBulkLoadTool.main(CsvBulkLoadTool.java:109)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:318)

at org.apache.hadoop.util.RunJar.main(RunJar.java:232)

18/11/27 15:31:48 INFO zookeeper.ZooKeeper: Session: 0x3672eebffa800c8
closed

18/11/27 15:31:48 INFO zookeeper.ClientCnxn: EventThread shut down

   - Also tried the other recommended option:

HADOOP_CLASSPATH=$(hbase
mapredcp):/etc/hbase/3.0.1.0-187/0/hbase-site.xml hadoop
jar /usr/hdp/3.0.1.0-187/phoenix/phoenix-5.0.0.3.0.1.0-187-client.jar
org.apache.phoenix.mapreduce.CsvBulkLoadTool --table MYTABLE --input
/ingest/MYCSV -z zk1,zk2,

JDBC Connection URL to Phoenix on Azure HDInsight

2018-11-27 Thread Raghavendra Channarayappa
Dear all,

My current problem is fairly trivial, i guess, but I am currently stuck on
this.

The following was the JDBC connection string for Phoenix on EMR

jdbc:phoenix:thin:url=http://ec2-12-345-678-90.ap-southeast-1.compute.amazonaws.com:8765;serialization=PROTOBUF;autocommit=true

which worked perfectly fine.

But, for Phoenix on Azure HDInsight(which uses HDP2.6.0), I am unable
to figure out the equivalent JDBC url

The following options which *DID NOT* work:
*Azure HDInsight Ambari Dashboard URL as the server_url:*
jdbc:phoenix:thin:url=http://randomcluterdashboard.azurehdinsight.net:8765;serialization=PROTOBUF;autocommit=true
*Phoenix Query Servers(comma-separated) as the server_url:*
jdbc:phoenix:wn0-apache.x.tx.internal.cloudapp.net,wn1-apache.x.tx.internal.cloudapp.net,wn2-apache.x.tx.internal.cloudapp.net,wn3-apache.x.tx.internal.cloudapp.net:2181:/hbase-unsecure
*Zookeeper Quorum(comma-separated) as the server_url:*
jdbc:phoenix:zk0-apache.x.tx.internal.cloudapp.net,zk1-apache.x.tx.internal.cloudapp.net,zk2-apache.x.tx.internal.cloudapp.net:2181:/hbase-unsecure

Where can  I find the JDBC url in Ambari dashboard?

Thanks in advance,

Raghavendra


Re: JDBC Connection URL to Phoenix on Azure HDInsight

2018-11-27 Thread Raghavendra Channarayappa
+Samya Ghosh 

On Tue, Nov 27, 2018 at 6:24 PM Raghavendra Channarayappa <
raghavendra.channaraya...@myntra.com> wrote:

> Dear all,
>
> My current problem is fairly trivial, i guess, but I am currently stuck on
> this.
>
> The following was the JDBC connection string for Phoenix on EMR
>
> jdbc:phoenix:thin:url=http://ec2-12-345-678-90.ap-southeast-1.compute.amazonaws.com:8765;serialization=PROTOBUF;autocommit=true
>
> which worked perfectly fine.
>
> But, for Phoenix on Azure HDInsight(which uses HDP2.6.0), I am unable to 
> figure out the equivalent JDBC url
>
> The following options which *DID NOT* work:
> *Azure HDInsight Ambari Dashboard URL as the server_url:* 
> jdbc:phoenix:thin:url=http://randomcluterdashboard.azurehdinsight.net:8765;serialization=PROTOBUF;autocommit=true
> *Phoenix Query Servers(comma-separated) as the server_url:* 
> jdbc:phoenix:wn0-apache.x.tx.internal.cloudapp.net,wn1-apache.x.tx.internal.cloudapp.net,wn2-apache.x.tx.internal.cloudapp.net,wn3-apache.x.tx.internal.cloudapp.net:2181:/hbase-unsecure
> *Zookeeper Quorum(comma-separated) as the server_url:* 
> jdbc:phoenix:zk0-apache.x.tx.internal.cloudapp.net,zk1-apache.x.tx.internal.cloudapp.net,zk2-apache.x.tx.internal.cloudapp.net:2181:/hbase-unsecure
>
> Where can  I find the JDBC url in Ambari dashboard?
>
> Thanks in advance,
>
> Raghavendra
>
>
>
>
>


Re: Rolling hourly data

2018-11-27 Thread Jonathan Leech
I would try writing the hourly values as 24 columns in a daily row, or as an 
array type.

 I’m not up to speed on the latest Phoenix features, but if it could update a 
daily sum on the fly that might be ok. If that doesn’t exist yet or isn’t 
performant, it could be done in an Hbase coprocessor.

I would also put the daily sum column in a separate column family in the same 
table, rather than a separate table.

- Jonathan

> On Nov 26, 2018, at 5:53 PM, Thomas D'Silva  wrote:
> 
> The 1-day aggregate query you are running should work, you might have to 
> increase the client thread pool and queue size.
> See 
> http://mail-archives.apache.org/mod_mbox/phoenix-user/201607.mbox/%3c577d338d.2080...@gmail.com%3E
> 
>> On Tue, Nov 20, 2018 at 9:26 PM Monil Gandhi  wrote:
>> Thanks for getting back.
>> The query we run is always an aggregation query. The date range is anytime 
>> current to 2 years back. 
>> I was looking into UPSERT SELECT, but there does not seem to be an easy way 
>> to run it. For instance I thought running something as below would be optimal
>> 
>> UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach) 
>> VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01', 
>> '-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM 
>> PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', '-MM-dd', 'GMT') 
>> AND TO_DATE('2018-07-02', '-MM-dd', 'GMT') GROUP BY segment_id, 
>> cross_segment_id)
>> 
>> Hence I tried running the select statement to see if it works. I started to 
>> see below error
>> Error: Task 
>> org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580 rejected 
>> from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool size = 128, 
>> active threads = 128, queued tasks = 5000, completed tasks = 5153] 
>> (state=08000,code=101)
>> 
>> When I changed the SELECT query to include a particular sid, the upsert 
>> select worked.
>> Hence I think the only way would be for me to run UPSERt for generating 
>> daily data for range of sids or segment_id.
>> 
>> Did I miss something?
>> 
>>> On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva  
>>> wrote:
>>> Since your PK already leads with (sid, day) I don't think adding a 
>>> secondary index will help. Do you generally always run the aggregation 
>>> query for the recently inserted data? The row timestamp feature might help 
>>> in this case 
>>> https://phoenix.apache.org/rowtimestamp.html
>>> If you run the same aggregate queries multiple times then another approach 
>>> is to store the coarser daily aggregated data in a separate table that you 
>>> can populate using an UPSERT SELECT.
>>> I'm not sure why the explain plan you attached has a CLIENT MERGE SORT, 
>>> since you don't have an order by.
>>> 
 On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi  wrote:
 Here it is 
 CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4 
 [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420 
 SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
 CLIENT MERGE SORT
 
 Note: we have a dedicated phoenix query server
 
> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby  
> wrote:
> Monil,
> 
> Could you please post the results of an EXPLAIN plan of your query? For 
> directions how to do this please see 
> http://phoenix.apache.org/explainplan.html
> 
> Geoffrey Jacoby
> 
>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi  wrote:
>> Hello,
>> Currently we have hourly data in our phoenix table. However, the schema 
>> was designed to perform well for daily data. Increasing the number of 
>> rows by 24X has lead to degradation of our service over time. 
>> Our current schema is as follows 
>> 
>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL, 
>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, 
>> day, cid) ) COMPRESSION='SNAPPY'
>> 
>> The query we run is something along the lines of 
>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4 
>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', '-MM-dd', 
>> 'GMT') AND TO_DATE('2018-07-02', '-MM-dd', 'GMT') GROUP BY sid, cid
>> 
>> Based on our investigation we have concluded that the main reason is 
>> purely the number of rows that are being read. I am open to other 
>> suggestions
>> 
>> If number of rows is the case
>> I am wondering if there is a way to either
>> 1. to roll hourly data to daily using views, secondary index or map 
>> reduce. I know map reduce is possible.
>> 2. migrate to a newer schema where cid is not part of pk and is actually 
>> a column family. I was unable to find any kind of documentation on this.
>> 
>> Thanks
>> Monil