Using Hints in Phoenix

2015-03-09 Thread Matthew Johnson
Hi guys,



This is more of a general question than a problem – but I’m just wondering
if someone can clarify for me what the syntax rules are for hints in
Phoenix. Does it matter where in the query they go? Do they always go
something like *SELECT insert hint x from y*? Or, if the hint is for a
join (eg Sort Merge) does it go in the join part (*SELECT x from y inner
join insert hint z on j = k*)?



Couldn’t seem to find anything specific on this in the docs, and haven’t
worked much with database hints in general so maybe there is a convention
that I am not aware of – apologies if it’s a stupid question!



Cheers,

Matt


Re: Using Hints in Phoenix

2015-03-09 Thread Maryann Xue
Hi Matt,

So far in Phoenix, hints are only supported as specified right after
keywords SELECT, UPSERT and DELETE. Same for join queries. It is currently
impossible to hint a certain join algorithm for a specific join node in a
multiple join query. However, for subqueries, the inner query can have its
own hints, independent of the outer query, like SELECT /*+ INDEX(t idx1)*/
col1, col2 FROM t WHERE col3 IN (SELECT /*+ NO_INDEX*/ id FROM r WHERE name
= 'x').


Thanks,
Maryann

On Mon, Mar 9, 2015 at 7:26 AM, Matthew Johnson matt.john...@algomi.com
wrote:

 Hi guys,



 This is more of a general question than a problem – but I’m just wondering
 if someone can clarify for me what the syntax rules are for hints in
 Phoenix. Does it matter where in the query they go? Do they always go
 something like *SELECT insert hint x from y*? Or, if the hint is for a
 join (eg Sort Merge) does it go in the join part (*SELECT x from y inner
 join insert hint z on j = k*)?



 Couldn’t seem to find anything specific on this in the docs, and haven’t
 worked much with database hints in general so maybe there is a convention
 that I am not aware of – apologies if it’s a stupid question!



 Cheers,

 Matt





Re: Phoenix table scan performance

2015-03-09 Thread Mujtaba Chohan
During your scan with data on single region server (RS), do you see RS
blocked on disk I/O due to heavy reads or 100% CPU utilized? if that is the
case then having data distributed on 2 RS would effectively cut time in
half.

On Mon, Mar 9, 2015 at 10:01 AM, Yohan Bismuth yohan.bismu...@gmail.com
wrote:

 Hello,
 we're currently using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2 on our
 cluster and we're experiencing some perf issues.

 What we need to do is a full table scan over 1 billion rows. We've got 50
 regionservers and approximatively 1000 regions of 1Gb equally distributed
 on these rs (which means ~20 regions per rs). Each node has 14 disks and 12
 cores.

 A simple Select count(1) from table is currently taking 400~500 sec.

 We noticed that a range scan over 2 regions located on 2 different rs
 seems to be done in parallel (taking 15~20 sec) but a range scan over 2
 regions of a single rs is taking twice this time (about 30~40 sec). We
 experience the same result with more than 2 regions.

 *Could this mean that parallelization is done at a regionserver level but
 not a region level *? in this case 400~500 seconds seems legit with 20~25
 regions per rs. We expected regions of a single rs to be scanned in
 parallel, is this a normal behavior or are we doing something wrong ?

 Thanks for your help



Re: Phoenix table scan performance

2015-03-09 Thread Yohan Bismuth
I've been facing this issue for a long time, so i'm pretty sure a major
compaction already occured.
Running your query returns 27006.

I have run update statistics on my table, this didn't solve my problem. But
if i understand well, these guideposts are used to parallelize scan over a
region, not between regions of a same regionserver, aren't they ?

On Mon, Mar 9, 2015 at 6:45 PM, James Taylor jamestay...@apache.org wrote:

 Hi Yohan,
 Have you done a major compaction on your table and are stats generated
 for your table? You can run this to confirm:
 SELECT sum(guide_posts_count) from SYSTEM.STATS where
 physical_name=your full table name;

 Phoenix does intra-region parallelization based on these guideposts as
 described briefly here:
 http://phoenix.apache.org/update_statistics.html

 Thanks,
 James

 On Mon, Mar 9, 2015 at 10:35 AM, Jerry chiling...@gmail.com wrote:
  Hi Yohan,
 
  I think your observation is correct. A scan in hbase is sequential by
  default unless you use something like HBASE-10502.
 
  Best Regards,
 
  Jerry
 
  Sent from my iPad
 
  On Mar 9, 2015, at 1:01 PM, Yohan Bismuth yohan.bismu...@gmail.com
 wrote:
 
  Hello,
  we're currently using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2 on our
  cluster and we're experiencing some perf issues.
 
  What we need to do is a full table scan over 1 billion rows. We've got 50
  regionservers and approximatively 1000 regions of 1Gb equally
 distributed on
  these rs (which means ~20 regions per rs). Each node has 14 disks and 12
  cores.
 
  A simple Select count(1) from table is currently taking 400~500 sec.
 
  We noticed that a range scan over 2 regions located on 2 different rs
 seems
  to be done in parallel (taking 15~20 sec) but a range scan over 2
 regions of
  a single rs is taking twice this time (about 30~40 sec). We experience
 the
  same result with more than 2 regions.
 
  Could this mean that parallelization is done at a regionserver level but
 not
  a region level ? in this case 400~500 seconds seems legit with 20~25
 regions
  per rs. We expected regions of a single rs to be scanned in parallel, is
  this a normal behavior or are we doing something wrong ?
 
  Thanks for your help



Re: Phoenix table scan performance

2015-03-09 Thread James Taylor
Hi Yohan,
Have you done a major compaction on your table and are stats generated
for your table? You can run this to confirm:
SELECT sum(guide_posts_count) from SYSTEM.STATS where
physical_name=your full table name;

Phoenix does intra-region parallelization based on these guideposts as
described briefly here:
http://phoenix.apache.org/update_statistics.html

Thanks,
James

On Mon, Mar 9, 2015 at 10:35 AM, Jerry chiling...@gmail.com wrote:
 Hi Yohan,

 I think your observation is correct. A scan in hbase is sequential by
 default unless you use something like HBASE-10502.

 Best Regards,

 Jerry

 Sent from my iPad

 On Mar 9, 2015, at 1:01 PM, Yohan Bismuth yohan.bismu...@gmail.com wrote:

 Hello,
 we're currently using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2 on our
 cluster and we're experiencing some perf issues.

 What we need to do is a full table scan over 1 billion rows. We've got 50
 regionservers and approximatively 1000 regions of 1Gb equally distributed on
 these rs (which means ~20 regions per rs). Each node has 14 disks and 12
 cores.

 A simple Select count(1) from table is currently taking 400~500 sec.

 We noticed that a range scan over 2 regions located on 2 different rs seems
 to be done in parallel (taking 15~20 sec) but a range scan over 2 regions of
 a single rs is taking twice this time (about 30~40 sec). We experience the
 same result with more than 2 regions.

 Could this mean that parallelization is done at a regionserver level but not
 a region level ? in this case 400~500 seconds seems legit with 20~25 regions
 per rs. We expected regions of a single rs to be scanned in parallel, is
 this a normal behavior or are we doing something wrong ?

 Thanks for your help


Re: Phoenix table scan performance

2015-03-09 Thread Yohan Bismuth
Sorry, we're not on aws but on bare metal

On Mon, Mar 9, 2015 at 6:13 PM, Brady, John john.br...@intel.com wrote:

  Hi Yohan,



 Apologies, I don’t have an answer to your question.



 Could I ask a separate question please? Is your cluster on AWS?



 I have Apache Phoenix installed on a 5 node cluster with 3 zookeeper nodes
 on AWS. Also using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2.  I put
 the phoenix server and client jars in the hbase class path on all nodes and
 restarted the cluster. The phoenix command line works on the cluster and
 running a JDBC app on the cluster returns data.

 The problem is that I can’t run a JDBC app outside the cluster.



 I've read that the link below that there is an issue on AWS where internal
 and external IPs get confused and zookeeper can't connect to HBase
 properly. Did you have this problem?


 http://stackoverflow.com/questions/28676561/apache-phoenix-jdbc-connection-zookeeper-error




 As suggested in the link  I solved this by creating aliases in /etc/hosts
 on the machines in the cluster pointing at internal IP addresses, then on
 my local desktop using the same aliases but pointing to the external IPs.
 Then, altered my cluster setup to use aliases everywhere instead of IP
 addresses. I could run the app on my local machine. But modifying cloud
 era config files to point to aliases on the servers ultimately breaks
 cloudera and isn’t a viable solution long term.



 Thanks

 John







 *From:* Yohan Bismuth [mailto:yohan.bismu...@gmail.com]
 *Sent:* Monday, March 09, 2015 5:02 PM
 *To:* user@phoenix.apache.org
 *Subject:* Phoenix table scan performance



 Hello,

 we're currently using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2 on our
 cluster and we're experiencing some perf issues.



 What we need to do is a full table scan over 1 billion rows. We've got 50
 regionservers and approximatively 1000 regions of 1Gb equally distributed
 on these rs (which means ~20 regions per rs). Each node has 14 disks and 12
 cores.



 A simple Select count(1) from table is currently taking 400~500 sec.



 We noticed that a range scan over 2 regions located on 2 different rs
 seems to be done in parallel (taking 15~20 sec) but a range scan over 2
 regions of a single rs is taking twice this time (about 30~40 sec). We
 experience the same result with more than 2 regions.



 *Could this mean that parallelization is done at a regionserver level but
 not a region level *? in this case 400~500 seconds seems legit with 20~25
 regions per rs. We expected regions of a single rs to be scanned in
 parallel, is this a normal behavior or are we doing something wrong ?



 Thanks for your help

 -
 Intel Ireland Limited (Branch)
 Collinstown Industrial Park, Leixlip, County Kildare, Ireland
 Registered Number: E902934

 This e-mail and any attachments may contain confidential material for
 the sole use of the intended recipient(s). Any review or distribution
 by others is strictly prohibited. If you are not the intended
 recipient, please contact the sender and delete all copies.



Re: Phoenix table scan performance

2015-03-09 Thread Yohan Bismuth
From what i've seen, we're mostly idle during scans.

On Mon, Mar 9, 2015 at 6:11 PM, Mujtaba Chohan mujt...@apache.org wrote:

 During your scan with data on single region server (RS), do you see RS
 blocked on disk I/O due to heavy reads or 100% CPU utilized? if that is the
 case then having data distributed on 2 RS would effectively cut time in
 half.

 On Mon, Mar 9, 2015 at 10:01 AM, Yohan Bismuth yohan.bismu...@gmail.com
 wrote:

 Hello,
 we're currently using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2 on our
 cluster and we're experiencing some perf issues.

 What we need to do is a full table scan over 1 billion rows. We've got 50
 regionservers and approximatively 1000 regions of 1Gb equally distributed
 on these rs (which means ~20 regions per rs). Each node has 14 disks and 12
 cores.

 A simple Select count(1) from table is currently taking 400~500 sec.

 We noticed that a range scan over 2 regions located on 2 different rs
 seems to be done in parallel (taking 15~20 sec) but a range scan over 2
 regions of a single rs is taking twice this time (about 30~40 sec). We
 experience the same result with more than 2 regions.

 *Could this mean that parallelization is done at a regionserver level but
 not a region level *? in this case 400~500 seconds seems legit with
 20~25 regions per rs. We expected regions of a single rs to be scanned in
 parallel, is this a normal behavior or are we doing something wrong ?

 Thanks for your help





Phoenix table scan performance

2015-03-09 Thread Yohan Bismuth
Hello,
we're currently using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2 on our
cluster and we're experiencing some perf issues.

What we need to do is a full table scan over 1 billion rows. We've got 50
regionservers and approximatively 1000 regions of 1Gb equally distributed
on these rs (which means ~20 regions per rs). Each node has 14 disks and 12
cores.

A simple Select count(1) from table is currently taking 400~500 sec.

We noticed that a range scan over 2 regions located on 2 different rs seems
to be done in parallel (taking 15~20 sec) but a range scan over 2 regions
of a single rs is taking twice this time (about 30~40 sec). We experience
the same result with more than 2 regions.

*Could this mean that parallelization is done at a regionserver level but
not a region level *? in this case 400~500 seconds seems legit with 20~25
regions per rs. We expected regions of a single rs to be scanned in
parallel, is this a normal behavior or are we doing something wrong ?

Thanks for your help


RE: Phoenix table scan performance

2015-03-09 Thread Brady, John
Hi Yohan,

Apologies, I don’t have an answer to your question.

Could I ask a separate question please? Is your cluster on AWS?

I have Apache Phoenix installed on a 5 node cluster with 3 zookeeper nodes on 
AWS. Also using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2.  I put the phoenix 
server and client jars in the hbase class path on all nodes and restarted the 
cluster. The phoenix command line works on the cluster and running a JDBC app 
on the cluster returns data.

The problem is that I can’t run a JDBC app outside the cluster.

I've read that the link below that there is an issue on AWS where internal and 
external IPs get confused and zookeeper can't connect to HBase properly. Did 
you have this problem?

http://stackoverflow.com/questions/28676561/apache-phoenix-jdbc-connection-zookeeper-error

As suggested in the link  I solved this by creating aliases in /etc/hosts on 
the machines in the cluster pointing at internal IP addresses, then on my local 
desktop using the same aliases but pointing to the external IPs. Then, altered 
my cluster setup to use aliases everywhere instead of IP addresses. I could run 
the app on my local machine. But modifying cloud era config files to point to 
aliases on the servers ultimately breaks cloudera and isn’t a viable solution 
long term.

Thanks
John



From: Yohan Bismuth [mailto:yohan.bismu...@gmail.com]
Sent: Monday, March 09, 2015 5:02 PM
To: user@phoenix.apache.org
Subject: Phoenix table scan performance

Hello,
we're currently using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2 on our 
cluster and we're experiencing some perf issues.

What we need to do is a full table scan over 1 billion rows. We've got 50 
regionservers and approximatively 1000 regions of 1Gb equally distributed on 
these rs (which means ~20 regions per rs). Each node has 14 disks and 12 cores.

A simple Select count(1) from table is currently taking 400~500 sec.

We noticed that a range scan over 2 regions located on 2 different rs seems to 
be done in parallel (taking 15~20 sec) but a range scan over 2 regions of a 
single rs is taking twice this time (about 30~40 sec). We experience the same 
result with more than 2 regions.

Could this mean that parallelization is done at a regionserver level but not a 
region level ? in this case 400~500 seconds seems legit with 20~25 regions per 
rs. We expected regions of a single rs to be scanned in parallel, is this a 
normal behavior or are we doing something wrong ?

Thanks for your help
-
Intel Ireland Limited (Branch)
Collinstown Industrial Park, Leixlip, County Kildare, Ireland
Registered Number: E902934

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Phoenix table scan performance

2015-03-09 Thread Fulin Sun
Hi, Yohan
What salts value you specified for your table ? Did you have a monitoring 
system for hbase that you can observe
your table had loadbalancy well? One phoenomena we got for your use case is 
that if we use DATA_BLOCK_ENCODING 
as PREFIX_TREE not the default FAST_DIFF, the full table scan performance can 
be improved greately also. 

Thanks,
Sun.





CertusNet 

From: Yohan Bismuth
Date: 2015-03-10 01:01
To: user
Subject: Phoenix table scan performance
Hello,
we're currently using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2 on our 
cluster and we're experiencing some perf issues.

What we need to do is a full table scan over 1 billion rows. We've got 50 
regionservers and approximatively 1000 regions of 1Gb equally distributed on 
these rs (which means ~20 regions per rs). Each node has 14 disks and 12 cores.

A simple Select count(1) from table is currently taking 400~500 sec.

We noticed that a range scan over 2 regions located on 2 different rs seems to 
be done in parallel (taking 15~20 sec) but a range scan over 2 regions of a 
single rs is taking twice this time (about 30~40 sec). We experience the same 
result with more than 2 regions. 

Could this mean that parallelization is done at a regionserver level but not a 
region level ? in this case 400~500 seconds seems legit with 20~25 regions per 
rs. We expected regions of a single rs to be scanned in parallel, is this a 
normal behavior or are we doing something wrong ?

Thanks for your help