Will the 10-100 million records get larger? Or is that just for a single user and you plan to have many users? If not, have you considered using a standard RDBMS like MySQL, Postgres, or Maria DB?
Thanks, James On Tuesday, April 12, 2016, Amit Shah <[email protected]> wrote: > Thanks James for the reply. Please see my comments below > > Secondary indexes[1] on the non-primary key columns is the way to improve >> performance for these case. Take a look at this[2] presentation for more >> detail. > > > I have done a brief reading on secondary indexes and I will go through the > video for a detailed study. I understand that they can help out in > improving the performance but with the OLAP workload that we plan to use > hbase + phoenix for probably secondary indexes will not be a good solution. > With OLAP, we could not have a control over the queries that are executed > during interactive analysis. Let me know if you have thoughts on this. > > I'd recommend at least 6 nodes and 10-20 nodes would be better in your >> test environment > > > We are in a proof of concept phase and have not yet finalized on how big > the hbase cluster would be in production. Initially we plan to start with > 4-5 nodes. Our data size would be in 10-100 million records (not in > billions for sure). Do you see a reason we should increase the cluster size? > > >> Have you seen our Pherf tool[3] that will help you benchmark your queries >> under representative data sizes? > > > I will look at the tool. > > Thanks again for sharing your inputs. > > On Mon, Apr 11, 2016 at 9:29 PM, James Taylor <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> Hi Amit, >> >> If a query doesn't filter on the primary key columns, the entire table >> must be scanned (hence it'll be slower). Secondary indexes[1] on the >> non-primary key columns is the way to improve performance for these case. >> Take a look at this[2] presentation for more detail. >> >> Also, a 3 node cluster is not ideal for benchmarking, but it might be >> fine to just play around with a bit. There's always overhead with a >> distributed system that you won't see in a more typical single node RDBMS. >> If you can mimic your cluster size you'll use in production, that'd be >> ideal . Otherwise, I'd recommend at least 6 nodes and 10-20 nodes would be >> better in your test environment. Have you seen our Pherf tool[3] that will >> help you benchmark your queries under representative data sizes? >> >> Thanks, >> James >> >> [1] https://phoenix.apache.org/secondary_indexing.html >> [2] https://www.youtube.com/watch?v=f4Nmh5KM6gI&feature=youtu.be >> [3] https://phoenix.apache.org/pherf.html >> >> On Mon, Apr 11, 2016 at 6:37 AM, Amit Shah <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >> >>> Hi Mujtaba, >>> >>> I observed that if the where-clause and group-by queries are applied on >>> the primary key columns, then they are superfast (~ 200 ms). This is not >>> the case with queries that have non-primary key columns in the where clause >>> and group by queries. I tried configuring the bucket cache but surprisingly >>> it doesn't give much speed improvement which I had thought of. One of these >>> queries are taking 26 secs and 31 secs on a 3 node Hbase cluster with and >>> without bucket cache respectively. >>> >>> Let me know if you have suggestions that we could try out. >>> >>> Regards, >>> Amit. >>> >>> On Tue, Mar 29, 2016 at 10:55 PM, Amit Shah <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>> >>>> Hi Mujtaba, >>>> >>>> Could these improvements be because of region distribution across >>>> region servers? Along with the optimizations you had suggested I had also >>>> used hbase-region-inspector to move regions evenly across the region >>>> server. >>>> >>>> Below is the table schema for the TRANSACTIONS table >>>> >>>> >>>> CREATE TABLE TRANSACTIONS_TEST ( >>>> AGENT_ID BIGINT >>>> ,A_NAME VARCHAR >>>> ,A_ROLE VARCHAR >>>> ,TERRITORY_ID BIGINT >>>> ,T_ZIPCODE BIGINT >>>> ,T_PLACE_NAME VARCHAR >>>> ,GRPBYCF.T_STATE VARCHAR >>>> ,GRPBYCF.T_COUNTRY VARCHAR >>>> ,PRODUCT_ID BIGINT NOT NULL >>>> ,P_NAME VARCHAR >>>> ,P_CATEGORY VARCHAR >>>> ,CHANNEL_ID BIGINT >>>> ,CH_NAME VARCHAR >>>> ,CH_TYPE VARCHAR >>>> ,CUSTOMER_ID BIGINT NOT NULL >>>> ,CS_NAME VARCHAR >>>> ,CS_TYPE VARCHAR >>>> ,IS_NEW_CUSTOMER BOOLEAN >>>> ,CLOSE_DATE DATE >>>> ,DAY_CNT_SPENT INTEGER >>>> ,TOTAL_EXPENSES BIGINT >>>> ,FORCASTED_SALES BIGINT >>>> ,GRPBYCF.UNIT_CNT_SOLD BIGINT >>>> ,PRICE_PER_UNIT BIGINT >>>> ,DISCOUNT_PERCENT BIGINT >>>> ,GRPBYCF.TOTAL_SALES BIGINT >>>> ,CONSTRAINT pk PRIMARY KEY (PRODUCT_ID, CUSTOMER_ID) >>>> ) COMPRESSION='SNAPPY'; >>>> >>>> I will try out the guidepost width reduction and let you know the >>>> results. >>>> >>>> Thank you, >>>> Amit. >>>> >>>> On Tue, Mar 29, 2016 at 10:50 PM, Mujtaba Chohan <[email protected] >>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>> >>>>> Optimization did help somewhat but not to the extent I was expecting. >>>>> See chart below. >>>>> >>>>> [image: Inline image 1] >>>>> >>>>> Can you share your table schema so I can experiment with it? Another >>>>> thing you can try is reducing guidepost >>>>> <https://phoenix.apache.org/tuning.html> width for this table by >>>>> executing UPDATE STATISTICS TRANSACTIONS SET >>>>> "phoenix.stats.guidepost.width"=50000000; >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah <[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>>> >>>>>> Hi Mujtaba, >>>>>> >>>>>> I did try the two optimization techniques by recreating the table and >>>>>> then loading it again with 10 mil records. They do not seem to help out >>>>>> much in terms of the timings. Kindly find the phoenix log file attached. >>>>>> Let me know if I am missing anything. >>>>>> >>>>>> Thanks, >>>>>> Amit. >>>>>> >>>>>> On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan <[email protected] >>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>>>> >>>>>>> Here's the chart for time it takes for each of the parallel scans >>>>>>> after split. On RS where data is not read from disk scan gets back in >>>>>>> ~20 >>>>>>> secs but for the RS which has 6 it's ~45 secs. >>>>>>> >>>>>>> [image: Inline image 2] >>>>>>> >>>>>>> Yes I see disk reads with 607 ios/second on the hosts that stores 6 >>>>>>>> regions >>>>>>>> >>>>>>> >>>>>>> Two things that you should try to reduce disk reads or maybe a >>>>>>> combination of both 1. Have only the columns used in your group by >>>>>>> query in >>>>>>> a separate column family CREATE TABLE T (K integer primary key, >>>>>>> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer, >>>>>>> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for >>>>>>> your table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a >>>>>>> major compaction. >>>>>>> >>>>>>> I tried to compact the table from the hbase web UI >>>>>>>> >>>>>>> >>>>>>> You need to do *major_compact* from HBase shell. From UI it's minor. >>>>>>> >>>>>>> - mujtaba >>>>>>> >>>>>>> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <[email protected] >>>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>>>>> >>>>>>>> Thanks Mujtaba and James for replying back. >>>>>>>> >>>>>>>> Mujtaba, Below are details to your follow up queries >>>>>>>> >>>>>>>> 1. How wide is your table >>>>>>>> >>>>>>>> >>>>>>>> I have 26 columns in the TRANSACTIONS table with a couple of >>>>>>>> columns combined to be marked as a primary key >>>>>>>> >>>>>>>> 2. How many region servers is your data distributed on and what's >>>>>>>>> the heap size? >>>>>>>> >>>>>>>> >>>>>>>> When I posted the initial readings of the query taking around 2 >>>>>>>> minutes, I had one region server storing 4 regions for the 10 mil >>>>>>>> records >>>>>>>> TRANSACTIONS table. The heap size on the master server is 1 GB while >>>>>>>> the >>>>>>>> region server has 3.63 GB heap setting. >>>>>>>> >>>>>>>> Later I added 2 more region servers to the cluster and configured >>>>>>>> them as data nodes and region servers. After this step, the regions got >>>>>>>> split on two region servers with the count as 2 on one region server >>>>>>>> and 6 >>>>>>>> on another. I didn't follow what action caused this region split or >>>>>>>> was it >>>>>>>> automatically done by hbase (load balancer??) >>>>>>>> >>>>>>>> 3. Do you see lots of disk I/O on region servers during aggregation? >>>>>>>> >>>>>>>> >>>>>>>> Yes I see disk reads with 607 ios/second on the hosts that stores >>>>>>>> 6 regions. Kindly find the disk io statistics attached as images. >>>>>>>> >>>>>>>> 4. Can you try your query after major compacting your table? >>>>>>>> >>>>>>>> >>>>>>>> I tried to compact the table from the hbase web UI. For some >>>>>>>> reason, the compaction table attribute on the web ui is still shown as >>>>>>>> NONE. After these changes, the query time is down to *42 secs. * >>>>>>>> Is compression different from compaction? Would the query >>>>>>>> performance improve by compressing the data by one of the algorithms? >>>>>>>> Logically it doesn't sound right though. >>>>>>>> >>>>>>>> Can you also replace log4j.properties with the attached one and >>>>>>>>> reply back with phoenix.log created by executing your query in >>>>>>>>> sqlline? >>>>>>>> >>>>>>>> >>>>>>>> After replacing the log4j.properties, I have captured the logs for >>>>>>>> the group by query execution and attached. >>>>>>>> >>>>>>>> >>>>>>>> James, >>>>>>>> If I follow the queries that you pasted, I see the index getting >>>>>>>> used but if I try to explain the query plan on the pre-loaded >>>>>>>> TRANSACTIONS >>>>>>>> table I do not see the index being used. Probably the query plan is >>>>>>>> changing based on whether the table has data or not. >>>>>>>> >>>>>>>> The query time is reduced down to 42 secs right now. Let me know if >>>>>>>> you have more suggestions on to improve it further. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Amit. >>>>>>>> >>>>>>>> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor < >>>>>>>> [email protected] >>>>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>>>>>> >>>>>>>>> Hi Amit, >>>>>>>>> Using 4.7.0-HBase-1.1 release, I see the index being used for that >>>>>>>>> query (see below). An index will help some, as the aggregation can be >>>>>>>>> done >>>>>>>>> in place as the scan over the index is occurring (as opposed to >>>>>>>>> having to >>>>>>>>> hold the distinct values found during grouping in memory per chunk of >>>>>>>>> work >>>>>>>>> and sorting each chunk on the client). It's not going to prevent the >>>>>>>>> entire >>>>>>>>> index from being scanned though. You'll need a WHERE clause to >>>>>>>>> prevent that. >>>>>>>>> >>>>>>>>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer >>>>>>>>> primary key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY >>>>>>>>> varchar); >>>>>>>>> No rows affected (1.32 seconds) >>>>>>>>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX >>>>>>>>> ON TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>>>>>>> No rows affected (6.452 seconds) >>>>>>>>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD), >>>>>>>>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY; >>>>>>>>> >>>>>>>>> +--------------------------------------------------------------------------+ >>>>>>>>> | PLAN >>>>>>>>> | >>>>>>>>> >>>>>>>>> +--------------------------------------------------------------------------+ >>>>>>>>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER >>>>>>>>> TRANSACTIONS_COUNTRY_INDEX | >>>>>>>>> | SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"] >>>>>>>>> | >>>>>>>>> | CLIENT MERGE SORT >>>>>>>>> | >>>>>>>>> >>>>>>>>> +--------------------------------------------------------------------------+ >>>>>>>>> 3 rows selected (0.028 seconds) >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> James >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan < >>>>>>>>> [email protected] >>>>>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>>>>>>> >>>>>>>>>> That seems excessively slow for 10M rows which should be in order >>>>>>>>>> of few seconds at most without index. 1. How wide is your table 2. >>>>>>>>>> How many >>>>>>>>>> region servers is your data distributed on and what's the heap size? >>>>>>>>>> 3. Do >>>>>>>>>> you see lots of disk I/O on region servers during aggregation? 4. >>>>>>>>>> Can you >>>>>>>>>> try your query after major compacting your table? >>>>>>>>>> >>>>>>>>>> Can you also replace log4j.properties with the attached one and >>>>>>>>>> reply back with phoenix.log created by executing your query in >>>>>>>>>> sqlline? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Mujtaba >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <[email protected] >>>>>>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I am trying to evaluate apache hbase (version 1.0.0) and >>>>>>>>>>> phoenix (version 4.6) deployed through cloudera for our OLAP >>>>>>>>>>> workfload. I have a table that has 10 mil rows. I try to execute >>>>>>>>>>> the below >>>>>>>>>>> roll up query and it takes around 2 mins to return 1,850 rows. >>>>>>>>>>> >>>>>>>>>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS >>>>>>>>>>> GROUP BY T_COUNTRY; >>>>>>>>>>> >>>>>>>>>>> I tried applying the "joining with indices" example given on the >>>>>>>>>>> website <https://phoenix.apache.org/joins.html> on the >>>>>>>>>>> TRANSACTIONS table by creating an index on the grouped by column as >>>>>>>>>>> below >>>>>>>>>>> but that doesn't help. >>>>>>>>>>> >>>>>>>>>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS >>>>>>>>>>> (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>>>>>>>>> >>>>>>>>>>> This index is not getting used when the query is executed. The >>>>>>>>>>> query plan is as below >>>>>>>>>>> >>>>>>>>>>> +------------------------------------------+ >>>>>>>>>>> | PLAN | >>>>>>>>>>> +------------------------------------------+ >>>>>>>>>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS | >>>>>>>>>>> | SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] | >>>>>>>>>>> | CLIENT MERGE SORT | >>>>>>>>>>> +------------------------------------------+ >>>>>>>>>>> >>>>>>>>>>> Theoretically can secondary indexes help improve the performance >>>>>>>>>>> of group by queries? >>>>>>>>>>> >>>>>>>>>>> Any suggestions on what are different options in phoenix I could >>>>>>>>>>> try out to speed up GROUP BY queries? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Amit. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
