CSV bulk loading question

2015-03-10 Thread Bulvik, Noam
Hi,

We are using the CSV bulk  loading (MR) to load our data. we have a table with 
50 columns and We did some testing to understand the factors on the performance 
of loading.
We compared two cases
A -  each column in the data will be a column in hbase table
B - take all non-key column and put them in one column in the hbase table

We saw that the second option we 7 times faster than the first one and consumed 
les CPU resources.

Does this make sense? Can we do something to tune the system so option A will 
run faster? (we prefer it this way because it enables us to query and filter 
over all data columns)


Regards,

Noam Bulvik




PRIVILEGED AND CONFIDENTIAL
PLEASE NOTE: The information contained in this message is privileged and 
confidential, and is intended only for the use of the individual to whom it is 
addressed and others who have been specifically authorized to receive it. If 
you are not the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, or if any 
problems occur with transmission, please contact sender. Thank you.


Re: Re: Phoenix table scan performance

2015-03-10 Thread Fulin Sun
No salting ? That may be an issue cause. Notice that hot region problem may 
occure and 
degrade performance a lot.

Thanks,
Sun.





CertusNet 

From: Yohan Bismuth
Date: 2015-03-10 17:36
To: user
Subject: Re: Phoenix table scan performance
Hi thanks for your answer,
we're not using data_block_encoding. Salting neither. After checking, i see 
that we currently have 20 to 36 regions per regionserver. This may not be a 
perfect balancing, but i'm not sure this is the root of all issues...


RE: installing Phoenix using AWS-EMR

2015-03-10 Thread Brady, John
The instructions are on the Phoenix site also - main question is how to install 
latest phoenix release.

http://phoenix.apache.org/phoenix_on_emr.html


From: Brady, John
Sent: Tuesday, March 10, 2015 9:56 AM
To: 'user@phoenix.apache.org'
Subject: installing Phoenix using AWS-EMR

Hi,

I'm installing Phoenix using AWS-EMR as described at the end of this link

http://java.dzone.com/articles/apache-phoenix-sql-driver

During setup you define the bootsrap action to load Phoenix onto your AWS EMR 
cluster.

I've a 2 questions.


1.   The bootstrap action installs phoenix 2.2.0 - how do you get it to 
install the latest release?


2.   If your using a JDBC connection which IP address should you connect 
to? Is it the master node?

Thanks

This is my code:

import java.sql.*;

public class PhoenixJDBC {

 public static void main(String args[]) {

 try {
 //Register JDBC Driver
 Class.forName(org.apache.phoenix.jdbc.PhoenixDriver).newInstance();

 Connection conn = 
DriverManager.getConnection(jdbc:phoenix:54.xx.xx.xx,,);

 //Create a Statement class to execute the SQL statement
 Statement stmt = conn.createStatement();

 //Execute the SQL statement and get the results in a Resultset
 ResultSet rs = stmt.executeQuery(select * from US_POPULATION);

 // Iterate through the ResultSet, displaying two values
 // for each row using the getString method

 while (rs.next())
 System.out.println(Name=  + rs.getString(host));
 }
 catch (SQLException e) {
 e.printStackTrace();
 }
 catch (Exception e) {
 e.printStackTrace();
 }
 }
 }


-
Intel Ireland Limited (Branch)
Collinstown Industrial Park, Leixlip, County Kildare, Ireland
Registered Number: E902934

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Re: Phoenix table scan performance

2015-03-10 Thread Yohan Bismuth
I'm not sure about this. I tested ranged scans over rs with almost no
activity, i had the same issue.

On Tue, Mar 10, 2015 at 10:47 AM, Fulin Sun su...@certusnet.com.cn wrote:

 No salting ? That may be an issue cause. Notice that hot region problem
 may occure and
 degrade performance a lot.

 Thanks,
 Sun.

 --
 --

 CertusNet


 *From:* Yohan Bismuth yohan.bismu...@gmail.com
 *Date:* 2015-03-10 17:36
 *To:* user user@phoenix.apache.org
 *Subject:* Re: Phoenix table scan performance
 Hi thanks for your answer,
 we're not using data_block_encoding. Salting neither. After checking, i
 see that we currently have 20 to 36 regions per regionserver. This may not
 be a perfect balancing, but i'm not sure this is the root of all issues...




RE: installing Phoenix using AWS-EMR

2015-03-10 Thread Matthew Johnson
Hi John,



I haven’t used the EMR bootstrap at all so not sure this is helpful, but
generally to upgrade the version of Phoenix you just remove the
phoenix-x.x.x-server.jar from the HBase libs folder and replace it with the
version that you want, and restart HBase. The big caveat to this though is
that the versions of HBase and Phoenix need to be compatible, so if the
bootstrap is installing an older version of HBase, it may not be possible
to use a newer version of Phoenix with it – you will have to check the
Phoenix release notes to ensure the versions are compatible.



The JDBC url should point to your zookeeper cluster, in a comma-separated
list. So if you have zookeeper running on a 3-server cluster, you would use
something like:



*jdbc:phoenix:zkmachine1,zkmachine2,zkmachine3*



Hope that helps.



Cheers,

Matt



*From:* Brady, John [mailto:john.br...@intel.com]
*Sent:* 10 March 2015 10:02
*To:* user@phoenix.apache.org
*Subject:* RE: installing Phoenix using AWS-EMR



The instructions are on the Phoenix site also – main question is how to
install latest phoenix release.



http://phoenix.apache.org/phoenix_on_emr.html





*From:* Brady, John
*Sent:* Tuesday, March 10, 2015 9:56 AM
*To:* 'user@phoenix.apache.org'
*Subject:* installing Phoenix using AWS-EMR



Hi,



I’m installing Phoenix using AWS-EMR as described at the end of this link



http://java.dzone.com/articles/apache-phoenix-sql-driver



During setup you define the bootsrap action to load Phoenix onto your AWS
EMR cluster.



I’ve a 2 questions.



1.   The bootstrap action installs phoenix 2.2.0 – how do you get it to
install the latest release?



2.   If your using a JDBC connection which IP address should you
connect to? Is it the master node?



Thanks



This is my code:



*import* java.sql.*;



*public* *class* PhoenixJDBC {



 *public* *static* *void* main(String args[]) {



 *try* {

 //Register JDBC Driver

 Class.*forName*(org.apache.phoenix.jdbc.PhoenixDriver
).newInstance();



 Connection conn = DriverManager.*getConnection*(
jdbc:phoenix:54.xx.xx.xx,,);



 //Create a Statement class to execute the SQL statement

 Statement stmt = conn.createStatement();



 //Execute the SQL statement and get the results in a *Resultset*

 ResultSet rs = stmt.executeQuery(select * from US_POPULATION);



 // Iterate through the ResultSet, displaying two values

 // for each row using the getString method



 *while* (rs.next())

 System.*out*.println(Name=  + rs.getString(host));

 }

 *catch* (SQLException e) {

 e.printStackTrace();

 }

 *catch* (Exception e) {

 e.printStackTrace();

 }

 }

 }





-
Intel Ireland Limited (Branch)
Collinstown Industrial Park, Leixlip, County Kildare, Ireland
Registered Number: E902934

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


installing Phoenix using AWS-EMR

2015-03-10 Thread Brady, John
Hi,

I'm installing Phoenix using AWS-EMR as described at the end of this link

http://java.dzone.com/articles/apache-phoenix-sql-driver

During setup you define the bootsrap action to load Phoenix onto your AWS EMR 
cluster.

I've a 2 questions.


1.   The bootstrap action installs phoenix 2.2.0 - how do you get it to 
install the latest release?


2.   If your using a JDBC connection which IP address should you connect 
to? Is it the master node?

Thanks

This is my code:

import java.sql.*;

public class PhoenixJDBC {

 public static void main(String args[]) {

 try {
 //Register JDBC Driver
 Class.forName(org.apache.phoenix.jdbc.PhoenixDriver).newInstance();

 Connection conn = 
DriverManager.getConnection(jdbc:phoenix:54.xx.xx.xx,,);

 //Create a Statement class to execute the SQL statement
 Statement stmt = conn.createStatement();

 //Execute the SQL statement and get the results in a Resultset
 ResultSet rs = stmt.executeQuery(select * from US_POPULATION);

 // Iterate through the ResultSet, displaying two values
 // for each row using the getString method

 while (rs.next())
 System.out.println(Name=  + rs.getString(host));
 }
 catch (SQLException e) {
 e.printStackTrace();
 }
 catch (Exception e) {
 e.printStackTrace();
 }
 }
 }


-
Intel Ireland Limited (Branch)
Collinstown Industrial Park, Leixlip, County Kildare, Ireland
Registered Number: E902934

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


RE: Creating multiple views off existent HBase table.

2015-03-10 Thread Matthew Johnson
Hi Sergey,



I was following your post as I wanted to do multiple views, and the
following is my experience:





*Does not work **(Undefined column family. familyName=ONBOARDING.null)**:*



create view Persistance_v8

(

rowid VARCHAR PRIMARY KEY,

onboarding.dealer_landscape_viewed boolean,

preferences.dealer_ids_order varchar

)

AS

SELECT * FROM persistance

where *dealer_landscape_viewed* is null





*Does work **(returns expected results when queried):*



create view Persistance_v8

(

rowid VARCHAR PRIMARY KEY,

onboarding.dealer_landscape_viewed boolean,

preferences.dealer_ids_order varchar

)

AS

SELECT * FROM persistance

where *onboarding.dealer_landscape_viewed* is null





So it seems that for me, quoting and including the column family in the
where clause of the view select is what fixes it (without quoting and
including the CF, Phoenix uppercases the CF and then complains about
CF.null).



However, I see in your Jira that you already have the CF quoted and
included, so I’m not sure why it works for me but not for you.



Cheers,

Matt







*From:* Sergey Belousov [mailto:sergey.belou...@gmail.com]
*Sent:* 09 March 2015 04:12
*To:* user@phoenix.apache.org
*Subject:* Re: Creating multiple views off existent HBase table.



done
https://issues.apache.org/jira/browse/PHOENIX-1714
let me know if it unclear and more clarification needed .

On Mar 8, 2015 11:47 PM, James Taylor jamestay...@apache.org wrote:

If you could break it down a bit, that'd be helpful, Sergey. It's a known
limitation that we don't support aggregation in views. The other stuff
should work, though. File one JIRA per issue you find.

Thanks,

James



On Sun, Mar 8, 2015 at 6:58 PM, Sergey Belousov sergey.belou...@gmail.com
wrote:

Hi James



Would you be able to shred some light on issues with WHERE part for CREATE
VIEW AS I am having as described in prier message?

I am kind of stuck here :(.



Thank you.



On Fri, Mar 6, 2015 at 7:09 PM, Sergey Belousov sergey.belou...@gmail.com
wrote:

And in addition for CREATE VIEW ... AS SELECT ...



your example

create view v2(a VARCHAR PRIMARY KEY, f1.c INTEGER) as select * from t1;



it seems that AS SELECT part is very limited. It looks like I can not
specify columns or any where clause there what kind of very
limits purpose of the VIEW. For example

I want to create VIEW for each CQ from existent HBase table column



In HBase shell:

create 't1'

put 't1','r1','f1:b','r1f1b'

put 't1','r1','f1:c','r1f1c'

put 't1','r2','f1:c','r2f1c'

put 't1','r3','f1:d','r3f1d'



hbase(main):041:0 scan 't1'

ROWCOLUMN+CELL



 r1column=f1:b,
timestamp=1425686025521, value=r1f1b


 r1column=f1:c,
timestamp=1425686025545, value=r1f1c


 r2column=f1:c,
timestamp=1425686025565, value=r2f1c


 r3column=f1:d,
timestamp=1425686025581, value=r3f1d



In Phoenix sqlline:

create view vB(a VARCHAR PRIMARY KEY, f1.b VARCHAR) as select * from
t1;

create view vC(a VARCHAR PRIMARY KEY, f1.c VARCHAR) as select * from
t1;

create view vD(a VARCHAR PRIMARY KEY, f1.d VARCHAR) as select * from
t1;



so basically when I do

select * from vB;



I do not want to see null



but I can only do it using when SELECT * FROM vB where b is not null; Even
thou this does not error

create view vB(a VARCHAR PRIMARY KEY, f1.b VARCHAR) as select * from
t1 where f1.b is not null;



here is result I am getting



select * from vB;

Error: ERROR 1001 (42I01): Undefined column family. familyName=F1.null
(state=42I01,code=1001)



select a,b from vB;

Error: ERROR 1001 (42I01): Undefined column family. familyName=F1.null
(state=42I01,code=1001)



Any ideas how to do it or why AS SELECT part is not same select you can do
when you query VIEW?



Thank you



On Thu, Mar 5, 2015 at 10:36 PM, Sergey Belousov sergey.belou...@gmail.com
wrote:

Thank you James



this one works for me too but my case little bit more complex

my key is

4byte4byte4byte8bytearray[0..N of 8bytes members]

I manage to project properly 4byte4byte4byte8byte but array[0..N
of 8bytes members] part keep figting me :)



Is it possible to have array of \x00\x00\x00\x01\x00\x00\x00\x01 elements
and what is proper way to surface it in the view and be able to have WHERE
on it with regex on bytes?









On Thu, Mar 5, 2015 at 3:44 PM, James Taylor jamestay...@apache.org wrote:

This worked fine for me.
In HBase shell:
create 't1', {NAME = 'f1'}

In Phoenix sqlline:
create view v1(a VARCHAR PRIMARY KEY, f1.b INTEGER) as select * from t1;
create view v2(a VARCHAR PRIMARY KEY, f1.c INTEGER) as select * from t1;

If your table name and column family name are not upper case, make
sure you surround them by double quotes when referencing in your
statement. You can also declare a default column family like this
instead:

create view 

Re: Using Hints in Phoenix

2015-03-10 Thread Maryann Xue
Hi Matt,

Which version of Phoenix are you using? Sort-merge join is only available
in Phoenix 4.3.


Thanks,
Maryann

On Tue, Mar 10, 2015 at 6:11 AM, Matthew Johnson matt.john...@algomi.com
wrote:

 Hi Maryann,



 Thanks for clarifying that for me. I’ve been playing with the hint
 *USE_SORT_MERGE_JOIN* to try and solve an issue when joining two very
 large tables:



 *Error: Encountered exception in sub plan [0] execution.*

 *SQLState:  null*

 *ErrorCode: 0*



 Which I believe is related to a lack of memory for building the hash
 table? I thought that using *SELECT /*+ USE_SORT_MERGE_JOIN*/* would make
 the join much slower but would work on an unlimited data set – did I
 misunderstand? The explain plan does not change with or without this hint:



 CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER mytable1

 SERVER AGGREGATE INTO SINGLE ROW

 PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE)

 CLIENT 15-CHUNK PARALLEL 1-WAY FULL SCAN OVER mytable2



 Cheers,

 Matt





 *From:* Maryann Xue [mailto:maryann@gmail.com]
 *Sent:* 09 March 2015 15:00
 *To:* user@phoenix.apache.org
 *Subject:* Re: Using Hints in Phoenix



 Hi Matt,



 So far in Phoenix, hints are only supported as specified right after
 keywords SELECT, UPSERT and DELETE. Same for join queries. It is currently
 impossible to hint a certain join algorithm for a specific join node in a
 multiple join query. However, for subqueries, the inner query can have its
 own hints, independent of the outer query, like SELECT /*+ INDEX(t idx1)*/
 col1, col2 FROM t WHERE col3 IN (SELECT /*+ NO_INDEX*/ id FROM r WHERE name
 = 'x').





 Thanks,

 Maryann



 On Mon, Mar 9, 2015 at 7:26 AM, Matthew Johnson matt.john...@algomi.com
 wrote:

 Hi guys,



 This is more of a general question than a problem – but I’m just wondering
 if someone can clarify for me what the syntax rules are for hints in
 Phoenix. Does it matter where in the query they go? Do they always go
 something like *SELECT insert hint x from y*? Or, if the hint is for a
 join (eg Sort Merge) does it go in the join part (*SELECT x from y inner
 join insert hint z on j = k*)?



 Couldn’t seem to find anything specific on this in the docs, and haven’t
 worked much with database hints in general so maybe there is a convention
 that I am not aware of – apologies if it’s a stupid question!



 Cheers,

 Matt







RE: installing Phoenix using AWS-EMR

2015-03-10 Thread Brady, John
Thanks Krishna.

If you have the scripts that would be very helpful.

John

From: Krishna [mailto:research...@gmail.com]
Sent: Tuesday, March 10, 2015 6:56 PM
To: user@phoenix.apache.org
Subject: Re: installing Phoenix using AWS-EMR

Latest version available on EMR is 0.94, so, you can upgrade phoenix to 3.x, 
however, EMR's default bootstrap script doesn't do that.
Download and save phoenix binaries to your S3 bucket, modify the EMR bootstrap 
script to install 3.x, save it to your S3 bucket and add it as a bootstrap 
action. I did something like this a while ago and will send it across if I find 
the scripts.
Here is the reference: 
https://github.com/awslabs/emr-bootstrap-actions/blob/master/phoenix/install-phoenix-bootstrap.sh


On Tue, Mar 10, 2015 at 2:55 AM, Brady, John 
john.br...@intel.commailto:john.br...@intel.com wrote:
Hi,

I’m installing Phoenix using AWS-EMR as described at the end of this link

http://java.dzone.com/articles/apache-phoenix-sql-driver

During setup you define the bootsrap action to load Phoenix onto your AWS EMR 
cluster.

I’ve a 2 questions.


1.   The bootstrap action installs phoenix 2.2.0 – how do you get it to 
install the latest release?


2.   If your using a JDBC connection which IP address should you connect 
to? Is it the master node?

Thanks

This is my code:

import java.sql.*;

public class PhoenixJDBC {

 public static void main(String args[]) {

 try {
 //Register JDBC Driver
 Class.forName(org.apache.phoenix.jdbc.PhoenixDriver).newInstance();

 Connection conn = 
DriverManager.getConnection(jdbc:phoenix:54.xx.xx.xx,,);

 //Create a Statement class to execute the SQL statement
 Statement stmt = conn.createStatement();

 //Execute the SQL statement and get the results in a Resultset
 ResultSet rs = stmt.executeQuery(select * from US_POPULATION);

 // Iterate through the ResultSet, displaying two values
 // for each row using the getString method

 while (rs.next())
 System.out.println(Name=  + rs.getString(host));
 }
 catch (SQLException e) {
 e.printStackTrace();
 }
 catch (Exception e) {
 e.printStackTrace();
 }
 }
 }



-
Intel Ireland Limited (Branch)
Collinstown Industrial Park, Leixlip, County Kildare, Ireland
Registered Number: E902934

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

-
Intel Ireland Limited (Branch)
Collinstown Industrial Park, Leixlip, County Kildare, Ireland
Registered Number: E902934

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: CSV bulk loading question

2015-03-10 Thread James Taylor
Hi Noam,
We're tuning CSV bulk load in PHOENIX-1711, but it won't get you a 7x
speedup (maybe 30% at the most if we're lucky). The other thing you'd
lose by writing all values into one column is incremental update speed
which may or may not apply for your use case. To update a single
value, you'd need to read the existing value, stitch in the new value
and write it back out again.

If your use case doesn't require incremental update, you can write
them all in a single column and still query and access them
individually. The easiest way I can think of doing this would be to
store them in a protobuf and serialize it into a VARBINARY column.
Then create a built-in function that allows accessing them
positionally or by name. Once you have that, could even add functional
indexes over individual fields (obviously with the overhead that
indexes add). PHOENIX-477 is about formalizing this as SQL Structs
which this could become if taken far enough. Even without this, just
having a set of built-in functions that work off of a protobuf would
be a useful first step and a great contribution.

Thanks,
James


On Mon, Mar 9, 2015 at 11:03 PM, Bulvik, Noam noam.bul...@teoco.com wrote:
 Hi,



 We are using the CSV bulk  loading (MR) to load our data. we have a table
 with 50 columns and We did some testing to understand the factors on the
 performance of loading.

 We compared two cases

 A -  each column in the data will be a column in hbase table

 B – take all non-key column and put them in one column in the hbase table



 We saw that the second option we 7 times faster than the first one and
 consumed les CPU resources.



 Does this make sense? Can we do something to tune the system so option A
 will run faster? (we prefer it this way because it enables us to query and
 filter over all data columns)





 Regards,



 Noam Bulvik




 

 PRIVILEGED AND CONFIDENTIAL
 PLEASE NOTE: The information contained in this message is privileged and
 confidential, and is intended only for the use of the individual to whom it
 is addressed and others who have been specifically authorized to receive it.
 If you are not the intended recipient, you are hereby notified that any
 dissemination, distribution or copying of this communication is strictly
 prohibited. If you have received this communication in error, or if any
 problems occur with transmission, please contact sender. Thank you.


RE: CSV bulk loading question

2015-03-10 Thread Bulvik, Noam
Thanks James
Any explanation to why do we see this difference after all we are writing the 
same data in both cases

-Original Message-
From: James Taylor [mailto:jamestay...@apache.org]
Sent: Tuesday, March 10, 2015 11:08 AM
To: user
Subject: Re: CSV bulk loading question

Hi Noam,
We're tuning CSV bulk load in PHOENIX-1711, but it won't get you a 7x speedup 
(maybe 30% at the most if we're lucky). The other thing you'd lose by writing 
all values into one column is incremental update speed which may or may not 
apply for your use case. To update a single value, you'd need to read the 
existing value, stitch in the new value and write it back out again.

If your use case doesn't require incremental update, you can write them all in 
a single column and still query and access them individually. The easiest way I 
can think of doing this would be to store them in a protobuf and serialize it 
into a VARBINARY column.
Then create a built-in function that allows accessing them positionally or by 
name. Once you have that, could even add functional indexes over individual 
fields (obviously with the overhead that indexes add). PHOENIX-477 is about 
formalizing this as SQL Structs which this could become if taken far enough. 
Even without this, just having a set of built-in functions that work off of a 
protobuf would be a useful first step and a great contribution.

Thanks,
James


On Mon, Mar 9, 2015 at 11:03 PM, Bulvik, Noam noam.bul...@teoco.com wrote:
 Hi,



 We are using the CSV bulk  loading (MR) to load our data. we have a
 table with 50 columns and We did some testing to understand the
 factors on the performance of loading.

 We compared two cases

 A -  each column in the data will be a column in hbase table

 B – take all non-key column and put them in one column in the hbase
 table



 We saw that the second option we 7 times faster than the first one and
 consumed les CPU resources.



 Does this make sense? Can we do something to tune the system so option
 A will run faster? (we prefer it this way because it enables us to
 query and filter over all data columns)





 Regards,



 Noam Bulvik




 

 PRIVILEGED AND CONFIDENTIAL
 PLEASE NOTE: The information contained in this message is privileged
 and confidential, and is intended only for the use of the individual
 to whom it is addressed and others who have been specifically authorized to 
 receive it.
 If you are not the intended recipient, you are hereby notified that
 any dissemination, distribution or copying of this communication is
 strictly prohibited. If you have received this communication in error,
 or if any problems occur with transmission, please contact sender. Thank you.



PRIVILEGED AND CONFIDENTIAL
PLEASE NOTE: The information contained in this message is privileged and 
confidential, and is intended only for the use of the individual to whom it is 
addressed and others who have been specifically authorized to receive it. If 
you are not the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, or if any 
problems occur with transmission, please contact sender. Thank you.