Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-16 Thread Amit Mudgal
Hi Teams,

Apologies for the late reply but i was trying to upload the data in LINEITEM 
table and my experience was not very good with the older version of phoenix 4.7 
but we did have a beefy cluster as pointed by my colleague earlier. 

After the jobs got completed i have seen some erratic behavior. If we drop the 
then we can see partial data in the LINEITEM table below. Also it took me 7 
hours to load the LINEITEM table file, I wanted to check if thats the time it 
took for you as well ?

0: jdbc:phoenix:stl-colo-srv073.> select count(*) from TPCH.LINEITEM;
+--+
| COUNT(1) |
+--+
| 0|
+--+
1 row selected (1.648 seconds)
0: jdbc:phoenix:stl-colo-srv073.> drop index L_SHIPDATE_IDX on TPCH.LINEITEM;
No rows affected (6.769 seconds)
0: jdbc:phoenix:stl-colo-srv073.> select count(*) from TPCH.LINEITEM;
+--+
| COUNT(1) |
+--+
| 5038442  |
+--+
1 row selected (122.127 seconds)

Could you please let us know if there are any TPCH benchmarks you have stored 
with system architecture details as well ?

To answer some questions posed :

1. We are loading the data via HDFS directly using the phoenix-client’s CsvBulk 
loader tool. We monitor the jobs later to see its completion. Do you recommend 
direct csv upload compared to hdfs upload , is there a preference for one 
against the other ?
2. There were no deleted rows from HBASE.
3. With the new cluster, as per your suggestion i did run the update stats and 
for the LINEITEM table but did not get the desired result.

0: jdbc:phoenix:stl-colo-srv073> update statistics TPCH.LINEITEM  SET
. . . . . . . . . . . . . . . . . . . . . . .> 
"phoenix.stats.guidepost.width"=1;
No rows affected (122.254 seconds)
0: jdbc:phoenix:stl-colo-srv073> select count(*) from TPCH.LINEITEM;
+--+
| COUNT(1) |
+--+
| 5038442  |
+--+
1 row selected (122.137 seconds)
0: jdbc:phoenix:stl-colo-srv073> 

Did you also face issues while loading the csv files ?
4. We have automatic compression enabled and also did manual compression on the 
individual tables from base shell.
5. What all queries worked for you on the TPCH data and which ones didn’t based 
on the comment that most of the TPCH queries work on phoenix.

I would greatly appreciate any input on this. 

Many thanks 

Amit




> On Aug 15, 2016, at 9:58 AM, James Taylor  wrote:
> 
> Hi Amit,
> Couple more performance tips on top of what Mujtaba already mentioned:
> - Use the latest Phoenix (4.8.0). There are some great performance
> enhancements in here, especially around usage of DISTINCT. We've also got
> some new encoding schemes to reduce table sizes in our encodecolumns branch
> which will make it into master in a few weeks.
> - Specify the UPDATE_CACHE_FREQUENCY property in your CREATE TABLE and
> CREATE INDEX calls. This will greatly reduce the amount of RPC traffic. See
> here for more info[1].
> - Make sure to create secondary indexes to prevent full table scans
> - Declare tables as IMMUTABLE_ROWS=true if they're write-once/append-only.
> - Depending on the use case/query patterns and the schema involved, you may
> want to use multiple column families. This prevents having to scan data
> that you don't need. More info on how to declare column families can be
> found here[2].
> - Consider using the SMALL, SERIAL, NO_CACHE, and SKIP_SCAN hints as
> necessary. We haven't yet enhanced our optimizer to do this automatically,
> so some experience in understanding what HBase is doing under the covers
> will help. For example, for point lookups, use the SMALL and SERIAL hint.
> For large scans, use the NO_CACHE hint. For low cardinality columns in the
> PK, try the SKIP_SCAN hint. For more on these hints, see here[3].
> 
> Thanks,
> James
> 
> [1] https://phoenix.apache.org/#Altering
> [2] https://phoenix.apache.org/language/index.html#create_table
> [3] https://phoenix.apache.org/language/index.html#select_statement
> 
> On Fri, Aug 12, 2016 at 2:57 PM, Mujtaba Chohan  wrote:
> 
>> Hi Amit,
>> 
>> * What's the heap size of each of your region servers?
>> * Do you see huge amount of disk reads when you do a select count(*) from
>> tpch.lineitem? If yes then try setting snappy compression on your table
>> followed by major compaction
>> * Were there any deleted rows in this table? What's the row count via HBase
>> shell?
>> * What's the schema of your table? How did you load your data?
>> * Update statistics with 100M guidepost width 

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-15 Thread James Taylor
Hi Aaron,
For commercial distros, you need to talk to the vendor. HDP 2.4.2 has a
very old version of Phoenix - 4.4 which is 4 minor releases back (an eon in
OS time). If you need something with commercial support, maybe you can get
an early access of the next HDP release, but I'd recommend just using
Apache HBase 1.2.0 and Apache Phoenix 4.8.0. That'll give you the most
flexibility. To install Phoenix, you just copy the server lib into the
HBase lib dir. More on that here:
https://phoenix.apache.org/installation.htm

For best values of hbase-site.xml, you can ask here. It varies greatly on a
use case by use case basis and requires experimentation. Mujtaba gave you
some good info in his response. I'd start by answering all those questions.

Thanks,
James

On Mon, Aug 15, 2016 at 10:33 AM, Aaron Molitor 
wrote:

> James,
>
> I am working with Amit on this task.  We have switched to an 9 node (8 RS)
> cluster running HP 2.4.2 with a mostly vanilla install.  I think our next
> steps are to incorporate Mujtaba's changes into our cluster config and
> re-run, we'll factor in your suggestions as well.
>
> Is there a "recommended" HBase config (that I admittedly may have missed)
> documented anywhere?  Clearly looking for something geared toward TPC-H
> type workloads.
>
> Is there a commercially available platform that includes Phoenix 4.8.0
> yet?  If not are there instructions on how to install it on an existing HDP
> 2.4.2 cluster?
>
> Thanks,
> Aaron
>
>
> On Aug 15, 2016, at 11:58, James Taylor  wrote:
>
> Hi Amit,
> Couple more performance tips on top of what Mujtaba already mentioned:
> - Use the latest Phoenix (4.8.0). There are some great performance
> enhancements in here, especially around usage of DISTINCT. We've also got
> some new encoding schemes to reduce table sizes in our encodecolumns branch
> which will make it into master in a few weeks.
> - Specify the UPDATE_CACHE_FREQUENCY property in your CREATE TABLE and
> CREATE INDEX calls. This will greatly reduce the amount of RPC traffic. See
> here for more info[1].
> - Make sure to create secondary indexes to prevent full table scans
> - Declare tables as IMMUTABLE_ROWS=true if they're write-once/append-only.
> - Depending on the use case/query patterns and the schema involved, you
> may want to use multiple column families. This prevents having to scan data
> that you don't need. More info on how to declare column families can be
> found here[2].
> - Consider using the SMALL, SERIAL, NO_CACHE, and SKIP_SCAN hints as
> necessary. We haven't yet enhanced our optimizer to do this automatically,
> so some experience in understanding what HBase is doing under the covers
> will help. For example, for point lookups, use the SMALL and SERIAL hint.
> For large scans, use the NO_CACHE hint. For low cardinality columns in the
> PK, try the SKIP_SCAN hint. For more on these hints, see here[3].
>
> Thanks,
> James
>
> [1] https://phoenix.apache.org/#Altering
> [2] https://phoenix.apache.org/language/index.html#create_table
> [3] https://phoenix.apache.org/language/index.html#select_statement
>
> On Fri, Aug 12, 2016 at 2:57 PM, Mujtaba Chohan 
> wrote:
>
>> Hi Amit,
>>
>> * What's the heap size of each of your region servers?
>> * Do you see huge amount of disk reads when you do a select count(*) from
>> tpch.lineitem? If yes then try setting snappy compression on your table
>> followed by major compaction
>> * Were there any deleted rows in this table? What's the row count via
>> HBase
>> shell?
>> * What's the schema of your table? How did you load your data?
>> * Update statistics with 100M guidepost width and check explain plan after
>> this async task finishes to see if this shows approximately correct row
>> count. update statistics TPCH.LINEITEM  SET
>> "phoenix.stats.guidepost.width"=1;
>>
>> I get the following number with 600M rows (uncompressed - default phoenix
>> fast_diff encoded) in TPCH.LINEITEM on a 8 node HBase 0.98.20 cluster with
>> 12G heap/12+12 core (virtual+physical). Data is fetched from OS page
>> cache.
>>
>> select count(*) from lineitem;
>> ++
>> |  COUNT(1)  |
>> ++
>> | 600037902  |
>> ++
>> 1 row selected (*57.012 seconds*)
>>
>> select l_returnflag, l_linestatus,sum(l_quantity) as
>> sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 -
>> l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) *
>> (1
>> + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as
>> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from
>> lineitem
>> where l_shipdate <= current_date()- 90 group by l_returnflag,l_linestatus
>> order by l_returnflag,l_linestatus;
>> +---+---++--
>> --+--+---+--
>> +-+---+--+
>> | L_RETURNFLAG  | 

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-15 Thread Aaron Molitor
James, 

I am working with Amit on this task.  We have switched to an 9 node (8 RS) 
cluster running HP 2.4.2 with a mostly vanilla install.  I think our next steps 
are to incorporate Mujtaba's changes into our cluster config and re-run, we'll 
factor in your suggestions as well.  

Is there a "recommended" HBase config (that I admittedly may have missed) 
documented anywhere?  Clearly looking for something geared toward TPC-H type 
workloads.  

Is there a commercially available platform that includes Phoenix 4.8.0 yet?  If 
not are there instructions on how to install it on an existing HDP 2.4.2 
cluster?

Thanks, 
Aaron


> On Aug 15, 2016, at 11:58, James Taylor  wrote:
> 
> Hi Amit,
> Couple more performance tips on top of what Mujtaba already mentioned:
> - Use the latest Phoenix (4.8.0). There are some great performance 
> enhancements in here, especially around usage of DISTINCT. We've also got 
> some new encoding schemes to reduce table sizes in our encodecolumns branch 
> which will make it into master in a few weeks.
> - Specify the UPDATE_CACHE_FREQUENCY property in your CREATE TABLE and CREATE 
> INDEX calls. This will greatly reduce the amount of RPC traffic. See here for 
> more info[1].
> - Make sure to create secondary indexes to prevent full table scans
> - Declare tables as IMMUTABLE_ROWS=true if they're write-once/append-only.
> - Depending on the use case/query patterns and the schema involved, you may 
> want to use multiple column families. This prevents having to scan data that 
> you don't need. More info on how to declare column families can be found 
> here[2].
> - Consider using the SMALL, SERIAL, NO_CACHE, and SKIP_SCAN hints as 
> necessary. We haven't yet enhanced our optimizer to do this automatically, so 
> some experience in understanding what HBase is doing under the covers will 
> help. For example, for point lookups, use the SMALL and SERIAL hint. For 
> large scans, use the NO_CACHE hint. For low cardinality columns in the PK, 
> try the SKIP_SCAN hint. For more on these hints, see here[3].
> 
> Thanks,
> James
> 
> [1] https://phoenix.apache.org/#Altering 
> 
> [2] https://phoenix.apache.org/language/index.html#create_table 
> 
> [3] https://phoenix.apache.org/language/index.html#select_statement 
> 
> 
> On Fri, Aug 12, 2016 at 2:57 PM, Mujtaba Chohan  > wrote:
> Hi Amit,
> 
> * What's the heap size of each of your region servers?
> * Do you see huge amount of disk reads when you do a select count(*) from
> tpch.lineitem? If yes then try setting snappy compression on your table
> followed by major compaction
> * Were there any deleted rows in this table? What's the row count via HBase
> shell?
> * What's the schema of your table? How did you load your data?
> * Update statistics with 100M guidepost width and check explain plan after
> this async task finishes to see if this shows approximately correct row
> count. update statistics TPCH.LINEITEM  SET
> "phoenix.stats.guidepost.width"=1;
> 
> I get the following number with 600M rows (uncompressed - default phoenix
> fast_diff encoded) in TPCH.LINEITEM on a 8 node HBase 0.98.20 cluster with
> 12G heap/12+12 core (virtual+physical). Data is fetched from OS page cache.
> 
> select count(*) from lineitem;
> ++
> |  COUNT(1)  |
> ++
> | 600037902  |
> ++
> 1 row selected (*57.012 seconds*)
> 
> select l_returnflag, l_linestatus,sum(l_quantity) as
> sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 -
> l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) * (1
> + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as
> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from lineitem
> where l_shipdate <= current_date()- 90 group by l_returnflag,l_linestatus
> order by l_returnflag,l_linestatus;
> +---+---+++--+---+--+-+---+--+
> | L_RETURNFLAG  | L_LINESTATUS  |SUM_QTY |   SUM_BASE_PRICE   |
> SUM_DISC_PRICE|  SUM_CHARGE   | AVG_QTY  |  AVG_PRICE  |
> AVG_DISC  | COUNT_ORDER  |
> +---+---+++--+---+--+-+---+--+
> | A | F | 3775127758 | 5660776097194.45   |
> 5377736398183.9374   | 5592847429515.927026  | 25.4993  | 38236.1169  |
> 0.05  | 148047881|
> | N | F | 98553062   | 147771098385.98|
> 140384965965.0348| 145999793032.775829   | 25.5015  | 38237.1993  |
> 0.0499| 3864590  |
> | N | O  

Re: Issues while Running Apache Phoenix against TPC-H data

2016-08-15 Thread James Taylor
Hi Amit,
Couple more performance tips on top of what Mujtaba already mentioned:
- Use the latest Phoenix (4.8.0). There are some great performance
enhancements in here, especially around usage of DISTINCT. We've also got
some new encoding schemes to reduce table sizes in our encodecolumns branch
which will make it into master in a few weeks.
- Specify the UPDATE_CACHE_FREQUENCY property in your CREATE TABLE and
CREATE INDEX calls. This will greatly reduce the amount of RPC traffic. See
here for more info[1].
- Make sure to create secondary indexes to prevent full table scans
- Declare tables as IMMUTABLE_ROWS=true if they're write-once/append-only.
- Depending on the use case/query patterns and the schema involved, you may
want to use multiple column families. This prevents having to scan data
that you don't need. More info on how to declare column families can be
found here[2].
- Consider using the SMALL, SERIAL, NO_CACHE, and SKIP_SCAN hints as
necessary. We haven't yet enhanced our optimizer to do this automatically,
so some experience in understanding what HBase is doing under the covers
will help. For example, for point lookups, use the SMALL and SERIAL hint.
For large scans, use the NO_CACHE hint. For low cardinality columns in the
PK, try the SKIP_SCAN hint. For more on these hints, see here[3].

Thanks,
James

[1] https://phoenix.apache.org/#Altering
[2] https://phoenix.apache.org/language/index.html#create_table
[3] https://phoenix.apache.org/language/index.html#select_statement

On Fri, Aug 12, 2016 at 2:57 PM, Mujtaba Chohan  wrote:

> Hi Amit,
>
> * What's the heap size of each of your region servers?
> * Do you see huge amount of disk reads when you do a select count(*) from
> tpch.lineitem? If yes then try setting snappy compression on your table
> followed by major compaction
> * Were there any deleted rows in this table? What's the row count via HBase
> shell?
> * What's the schema of your table? How did you load your data?
> * Update statistics with 100M guidepost width and check explain plan after
> this async task finishes to see if this shows approximately correct row
> count. update statistics TPCH.LINEITEM  SET
> "phoenix.stats.guidepost.width"=1;
>
> I get the following number with 600M rows (uncompressed - default phoenix
> fast_diff encoded) in TPCH.LINEITEM on a 8 node HBase 0.98.20 cluster with
> 12G heap/12+12 core (virtual+physical). Data is fetched from OS page cache.
>
> select count(*) from lineitem;
> ++
> |  COUNT(1)  |
> ++
> | 600037902  |
> ++
> 1 row selected (*57.012 seconds*)
>
> select l_returnflag, l_linestatus,sum(l_quantity) as
> sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 -
> l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) * (1
> + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as
> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from lineitem
> where l_shipdate <= current_date()- 90 group by l_returnflag,l_linestatus
> order by l_returnflag,l_linestatus;
> +---+---++--
> --+--+---+--
> +-+---+--+
> | L_RETURNFLAG  | L_LINESTATUS  |SUM_QTY |   SUM_BASE_PRICE   |
> SUM_DISC_PRICE|  SUM_CHARGE   | AVG_QTY  |  AVG_PRICE  |
> AVG_DISC  | COUNT_ORDER  |
> +---+---++--
> --+--+---+--
> +-+---+--+
> | A | F | 3775127758 | 5660776097194.45   |
> 5377736398183.9374   | 5592847429515.927026  | 25.4993  | 38236.1169  |
> 0.05  | 148047881|
> | N | F | 98553062   | 147771098385.98|
> 140384965965.0348| 145999793032.775829   | 25.5015  | 38237.1993  |
> 0.0499| 3864590  |
> | N | O | 7651423419 | 11473321691083.27  |
> 10899667121317.2466  | 11335664103186.27932  | 25.4998  | 38236.9914  |
> 0.0499| 300058170|
> | R | F | 3.77572497E+9  | 5661603032745.34   |
> 5378513563915.4097   | 5593662252666.916161  | 25.5 | 38236.6972  |
> 0.05  | 148067261|
> +---+---++--
> --+--+---+--
> +-+---+--+
> 4 rows selected (*146.677 seconds*)
>
> explain select count(*) from lineitem ;
> +---
> --+
> |
> PLAN |
> +---
> --+
> | CLIENT 4204-CHUNK 589522747 ROWS 409200095701 BYTES PARALLEL 1-WAY FULL
> SCAN OVER LINEITEM  |
> | SERVER FILTER BY FIRST KEY