CSV bulk loading question
Hi, We are using the CSV bulk loading (MR) to load our data. we have a table with 50 columns and We did some testing to understand the factors on the performance of loading. We compared two cases A - each column in the data will be a column in hbase table B - take all non-key column and put them in one column in the hbase table We saw that the second option we 7 times faster than the first one and consumed les CPU resources. Does this make sense? Can we do something to tune the system so option A will run faster? (we prefer it this way because it enables us to query and filter over all data columns) Regards, Noam Bulvik PRIVILEGED AND CONFIDENTIAL PLEASE NOTE: The information contained in this message is privileged and confidential, and is intended only for the use of the individual to whom it is addressed and others who have been specifically authorized to receive it. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, or if any problems occur with transmission, please contact sender. Thank you.
Re: Re: Phoenix table scan performance
No salting ? That may be an issue cause. Notice that hot region problem may occure and degrade performance a lot. Thanks, Sun. CertusNet From: Yohan Bismuth Date: 2015-03-10 17:36 To: user Subject: Re: Phoenix table scan performance Hi thanks for your answer, we're not using data_block_encoding. Salting neither. After checking, i see that we currently have 20 to 36 regions per regionserver. This may not be a perfect balancing, but i'm not sure this is the root of all issues...
RE: installing Phoenix using AWS-EMR
The instructions are on the Phoenix site also - main question is how to install latest phoenix release. http://phoenix.apache.org/phoenix_on_emr.html From: Brady, John Sent: Tuesday, March 10, 2015 9:56 AM To: 'user@phoenix.apache.org' Subject: installing Phoenix using AWS-EMR Hi, I'm installing Phoenix using AWS-EMR as described at the end of this link http://java.dzone.com/articles/apache-phoenix-sql-driver During setup you define the bootsrap action to load Phoenix onto your AWS EMR cluster. I've a 2 questions. 1. The bootstrap action installs phoenix 2.2.0 - how do you get it to install the latest release? 2. If your using a JDBC connection which IP address should you connect to? Is it the master node? Thanks This is my code: import java.sql.*; public class PhoenixJDBC { public static void main(String args[]) { try { //Register JDBC Driver Class.forName(org.apache.phoenix.jdbc.PhoenixDriver).newInstance(); Connection conn = DriverManager.getConnection(jdbc:phoenix:54.xx.xx.xx,,); //Create a Statement class to execute the SQL statement Statement stmt = conn.createStatement(); //Execute the SQL statement and get the results in a Resultset ResultSet rs = stmt.executeQuery(select * from US_POPULATION); // Iterate through the ResultSet, displaying two values // for each row using the getString method while (rs.next()) System.out.println(Name= + rs.getString(host)); } catch (SQLException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } } } - Intel Ireland Limited (Branch) Collinstown Industrial Park, Leixlip, County Kildare, Ireland Registered Number: E902934 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Re: Re: Phoenix table scan performance
I'm not sure about this. I tested ranged scans over rs with almost no activity, i had the same issue. On Tue, Mar 10, 2015 at 10:47 AM, Fulin Sun su...@certusnet.com.cn wrote: No salting ? That may be an issue cause. Notice that hot region problem may occure and degrade performance a lot. Thanks, Sun. -- -- CertusNet *From:* Yohan Bismuth yohan.bismu...@gmail.com *Date:* 2015-03-10 17:36 *To:* user user@phoenix.apache.org *Subject:* Re: Phoenix table scan performance Hi thanks for your answer, we're not using data_block_encoding. Salting neither. After checking, i see that we currently have 20 to 36 regions per regionserver. This may not be a perfect balancing, but i'm not sure this is the root of all issues...
RE: installing Phoenix using AWS-EMR
Hi John, I haven’t used the EMR bootstrap at all so not sure this is helpful, but generally to upgrade the version of Phoenix you just remove the phoenix-x.x.x-server.jar from the HBase libs folder and replace it with the version that you want, and restart HBase. The big caveat to this though is that the versions of HBase and Phoenix need to be compatible, so if the bootstrap is installing an older version of HBase, it may not be possible to use a newer version of Phoenix with it – you will have to check the Phoenix release notes to ensure the versions are compatible. The JDBC url should point to your zookeeper cluster, in a comma-separated list. So if you have zookeeper running on a 3-server cluster, you would use something like: *jdbc:phoenix:zkmachine1,zkmachine2,zkmachine3* Hope that helps. Cheers, Matt *From:* Brady, John [mailto:john.br...@intel.com] *Sent:* 10 March 2015 10:02 *To:* user@phoenix.apache.org *Subject:* RE: installing Phoenix using AWS-EMR The instructions are on the Phoenix site also – main question is how to install latest phoenix release. http://phoenix.apache.org/phoenix_on_emr.html *From:* Brady, John *Sent:* Tuesday, March 10, 2015 9:56 AM *To:* 'user@phoenix.apache.org' *Subject:* installing Phoenix using AWS-EMR Hi, I’m installing Phoenix using AWS-EMR as described at the end of this link http://java.dzone.com/articles/apache-phoenix-sql-driver During setup you define the bootsrap action to load Phoenix onto your AWS EMR cluster. I’ve a 2 questions. 1. The bootstrap action installs phoenix 2.2.0 – how do you get it to install the latest release? 2. If your using a JDBC connection which IP address should you connect to? Is it the master node? Thanks This is my code: *import* java.sql.*; *public* *class* PhoenixJDBC { *public* *static* *void* main(String args[]) { *try* { //Register JDBC Driver Class.*forName*(org.apache.phoenix.jdbc.PhoenixDriver ).newInstance(); Connection conn = DriverManager.*getConnection*( jdbc:phoenix:54.xx.xx.xx,,); //Create a Statement class to execute the SQL statement Statement stmt = conn.createStatement(); //Execute the SQL statement and get the results in a *Resultset* ResultSet rs = stmt.executeQuery(select * from US_POPULATION); // Iterate through the ResultSet, displaying two values // for each row using the getString method *while* (rs.next()) System.*out*.println(Name= + rs.getString(host)); } *catch* (SQLException e) { e.printStackTrace(); } *catch* (Exception e) { e.printStackTrace(); } } } - Intel Ireland Limited (Branch) Collinstown Industrial Park, Leixlip, County Kildare, Ireland Registered Number: E902934 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
installing Phoenix using AWS-EMR
Hi, I'm installing Phoenix using AWS-EMR as described at the end of this link http://java.dzone.com/articles/apache-phoenix-sql-driver During setup you define the bootsrap action to load Phoenix onto your AWS EMR cluster. I've a 2 questions. 1. The bootstrap action installs phoenix 2.2.0 - how do you get it to install the latest release? 2. If your using a JDBC connection which IP address should you connect to? Is it the master node? Thanks This is my code: import java.sql.*; public class PhoenixJDBC { public static void main(String args[]) { try { //Register JDBC Driver Class.forName(org.apache.phoenix.jdbc.PhoenixDriver).newInstance(); Connection conn = DriverManager.getConnection(jdbc:phoenix:54.xx.xx.xx,,); //Create a Statement class to execute the SQL statement Statement stmt = conn.createStatement(); //Execute the SQL statement and get the results in a Resultset ResultSet rs = stmt.executeQuery(select * from US_POPULATION); // Iterate through the ResultSet, displaying two values // for each row using the getString method while (rs.next()) System.out.println(Name= + rs.getString(host)); } catch (SQLException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } } } - Intel Ireland Limited (Branch) Collinstown Industrial Park, Leixlip, County Kildare, Ireland Registered Number: E902934 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
RE: Creating multiple views off existent HBase table.
Hi Sergey, I was following your post as I wanted to do multiple views, and the following is my experience: *Does not work **(Undefined column family. familyName=ONBOARDING.null)**:* create view Persistance_v8 ( rowid VARCHAR PRIMARY KEY, onboarding.dealer_landscape_viewed boolean, preferences.dealer_ids_order varchar ) AS SELECT * FROM persistance where *dealer_landscape_viewed* is null *Does work **(returns expected results when queried):* create view Persistance_v8 ( rowid VARCHAR PRIMARY KEY, onboarding.dealer_landscape_viewed boolean, preferences.dealer_ids_order varchar ) AS SELECT * FROM persistance where *onboarding.dealer_landscape_viewed* is null So it seems that for me, quoting and including the column family in the where clause of the view select is what fixes it (without quoting and including the CF, Phoenix uppercases the CF and then complains about CF.null). However, I see in your Jira that you already have the CF quoted and included, so I’m not sure why it works for me but not for you. Cheers, Matt *From:* Sergey Belousov [mailto:sergey.belou...@gmail.com] *Sent:* 09 March 2015 04:12 *To:* user@phoenix.apache.org *Subject:* Re: Creating multiple views off existent HBase table. done https://issues.apache.org/jira/browse/PHOENIX-1714 let me know if it unclear and more clarification needed . On Mar 8, 2015 11:47 PM, James Taylor jamestay...@apache.org wrote: If you could break it down a bit, that'd be helpful, Sergey. It's a known limitation that we don't support aggregation in views. The other stuff should work, though. File one JIRA per issue you find. Thanks, James On Sun, Mar 8, 2015 at 6:58 PM, Sergey Belousov sergey.belou...@gmail.com wrote: Hi James Would you be able to shred some light on issues with WHERE part for CREATE VIEW AS I am having as described in prier message? I am kind of stuck here :(. Thank you. On Fri, Mar 6, 2015 at 7:09 PM, Sergey Belousov sergey.belou...@gmail.com wrote: And in addition for CREATE VIEW ... AS SELECT ... your example create view v2(a VARCHAR PRIMARY KEY, f1.c INTEGER) as select * from t1; it seems that AS SELECT part is very limited. It looks like I can not specify columns or any where clause there what kind of very limits purpose of the VIEW. For example I want to create VIEW for each CQ from existent HBase table column In HBase shell: create 't1' put 't1','r1','f1:b','r1f1b' put 't1','r1','f1:c','r1f1c' put 't1','r2','f1:c','r2f1c' put 't1','r3','f1:d','r3f1d' hbase(main):041:0 scan 't1' ROWCOLUMN+CELL r1column=f1:b, timestamp=1425686025521, value=r1f1b r1column=f1:c, timestamp=1425686025545, value=r1f1c r2column=f1:c, timestamp=1425686025565, value=r2f1c r3column=f1:d, timestamp=1425686025581, value=r3f1d In Phoenix sqlline: create view vB(a VARCHAR PRIMARY KEY, f1.b VARCHAR) as select * from t1; create view vC(a VARCHAR PRIMARY KEY, f1.c VARCHAR) as select * from t1; create view vD(a VARCHAR PRIMARY KEY, f1.d VARCHAR) as select * from t1; so basically when I do select * from vB; I do not want to see null but I can only do it using when SELECT * FROM vB where b is not null; Even thou this does not error create view vB(a VARCHAR PRIMARY KEY, f1.b VARCHAR) as select * from t1 where f1.b is not null; here is result I am getting select * from vB; Error: ERROR 1001 (42I01): Undefined column family. familyName=F1.null (state=42I01,code=1001) select a,b from vB; Error: ERROR 1001 (42I01): Undefined column family. familyName=F1.null (state=42I01,code=1001) Any ideas how to do it or why AS SELECT part is not same select you can do when you query VIEW? Thank you On Thu, Mar 5, 2015 at 10:36 PM, Sergey Belousov sergey.belou...@gmail.com wrote: Thank you James this one works for me too but my case little bit more complex my key is 4byte4byte4byte8bytearray[0..N of 8bytes members] I manage to project properly 4byte4byte4byte8byte but array[0..N of 8bytes members] part keep figting me :) Is it possible to have array of \x00\x00\x00\x01\x00\x00\x00\x01 elements and what is proper way to surface it in the view and be able to have WHERE on it with regex on bytes? On Thu, Mar 5, 2015 at 3:44 PM, James Taylor jamestay...@apache.org wrote: This worked fine for me. In HBase shell: create 't1', {NAME = 'f1'} In Phoenix sqlline: create view v1(a VARCHAR PRIMARY KEY, f1.b INTEGER) as select * from t1; create view v2(a VARCHAR PRIMARY KEY, f1.c INTEGER) as select * from t1; If your table name and column family name are not upper case, make sure you surround them by double quotes when referencing in your statement. You can also declare a default column family like this instead: create view
Re: Using Hints in Phoenix
Hi Matt, Which version of Phoenix are you using? Sort-merge join is only available in Phoenix 4.3. Thanks, Maryann On Tue, Mar 10, 2015 at 6:11 AM, Matthew Johnson matt.john...@algomi.com wrote: Hi Maryann, Thanks for clarifying that for me. I’ve been playing with the hint *USE_SORT_MERGE_JOIN* to try and solve an issue when joining two very large tables: *Error: Encountered exception in sub plan [0] execution.* *SQLState: null* *ErrorCode: 0* Which I believe is related to a lack of memory for building the hash table? I thought that using *SELECT /*+ USE_SORT_MERGE_JOIN*/* would make the join much slower but would work on an unlimited data set – did I misunderstand? The explain plan does not change with or without this hint: CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER mytable1 SERVER AGGREGATE INTO SINGLE ROW PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE) CLIENT 15-CHUNK PARALLEL 1-WAY FULL SCAN OVER mytable2 Cheers, Matt *From:* Maryann Xue [mailto:maryann@gmail.com] *Sent:* 09 March 2015 15:00 *To:* user@phoenix.apache.org *Subject:* Re: Using Hints in Phoenix Hi Matt, So far in Phoenix, hints are only supported as specified right after keywords SELECT, UPSERT and DELETE. Same for join queries. It is currently impossible to hint a certain join algorithm for a specific join node in a multiple join query. However, for subqueries, the inner query can have its own hints, independent of the outer query, like SELECT /*+ INDEX(t idx1)*/ col1, col2 FROM t WHERE col3 IN (SELECT /*+ NO_INDEX*/ id FROM r WHERE name = 'x'). Thanks, Maryann On Mon, Mar 9, 2015 at 7:26 AM, Matthew Johnson matt.john...@algomi.com wrote: Hi guys, This is more of a general question than a problem – but I’m just wondering if someone can clarify for me what the syntax rules are for hints in Phoenix. Does it matter where in the query they go? Do they always go something like *SELECT insert hint x from y*? Or, if the hint is for a join (eg Sort Merge) does it go in the join part (*SELECT x from y inner join insert hint z on j = k*)? Couldn’t seem to find anything specific on this in the docs, and haven’t worked much with database hints in general so maybe there is a convention that I am not aware of – apologies if it’s a stupid question! Cheers, Matt
RE: installing Phoenix using AWS-EMR
Thanks Krishna. If you have the scripts that would be very helpful. John From: Krishna [mailto:research...@gmail.com] Sent: Tuesday, March 10, 2015 6:56 PM To: user@phoenix.apache.org Subject: Re: installing Phoenix using AWS-EMR Latest version available on EMR is 0.94, so, you can upgrade phoenix to 3.x, however, EMR's default bootstrap script doesn't do that. Download and save phoenix binaries to your S3 bucket, modify the EMR bootstrap script to install 3.x, save it to your S3 bucket and add it as a bootstrap action. I did something like this a while ago and will send it across if I find the scripts. Here is the reference: https://github.com/awslabs/emr-bootstrap-actions/blob/master/phoenix/install-phoenix-bootstrap.sh On Tue, Mar 10, 2015 at 2:55 AM, Brady, John john.br...@intel.commailto:john.br...@intel.com wrote: Hi, I’m installing Phoenix using AWS-EMR as described at the end of this link http://java.dzone.com/articles/apache-phoenix-sql-driver During setup you define the bootsrap action to load Phoenix onto your AWS EMR cluster. I’ve a 2 questions. 1. The bootstrap action installs phoenix 2.2.0 – how do you get it to install the latest release? 2. If your using a JDBC connection which IP address should you connect to? Is it the master node? Thanks This is my code: import java.sql.*; public class PhoenixJDBC { public static void main(String args[]) { try { //Register JDBC Driver Class.forName(org.apache.phoenix.jdbc.PhoenixDriver).newInstance(); Connection conn = DriverManager.getConnection(jdbc:phoenix:54.xx.xx.xx,,); //Create a Statement class to execute the SQL statement Statement stmt = conn.createStatement(); //Execute the SQL statement and get the results in a Resultset ResultSet rs = stmt.executeQuery(select * from US_POPULATION); // Iterate through the ResultSet, displaying two values // for each row using the getString method while (rs.next()) System.out.println(Name= + rs.getString(host)); } catch (SQLException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } } } - Intel Ireland Limited (Branch) Collinstown Industrial Park, Leixlip, County Kildare, Ireland Registered Number: E902934 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. - Intel Ireland Limited (Branch) Collinstown Industrial Park, Leixlip, County Kildare, Ireland Registered Number: E902934 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Re: CSV bulk loading question
Hi Noam, We're tuning CSV bulk load in PHOENIX-1711, but it won't get you a 7x speedup (maybe 30% at the most if we're lucky). The other thing you'd lose by writing all values into one column is incremental update speed which may or may not apply for your use case. To update a single value, you'd need to read the existing value, stitch in the new value and write it back out again. If your use case doesn't require incremental update, you can write them all in a single column and still query and access them individually. The easiest way I can think of doing this would be to store them in a protobuf and serialize it into a VARBINARY column. Then create a built-in function that allows accessing them positionally or by name. Once you have that, could even add functional indexes over individual fields (obviously with the overhead that indexes add). PHOENIX-477 is about formalizing this as SQL Structs which this could become if taken far enough. Even without this, just having a set of built-in functions that work off of a protobuf would be a useful first step and a great contribution. Thanks, James On Mon, Mar 9, 2015 at 11:03 PM, Bulvik, Noam noam.bul...@teoco.com wrote: Hi, We are using the CSV bulk loading (MR) to load our data. we have a table with 50 columns and We did some testing to understand the factors on the performance of loading. We compared two cases A - each column in the data will be a column in hbase table B – take all non-key column and put them in one column in the hbase table We saw that the second option we 7 times faster than the first one and consumed les CPU resources. Does this make sense? Can we do something to tune the system so option A will run faster? (we prefer it this way because it enables us to query and filter over all data columns) Regards, Noam Bulvik PRIVILEGED AND CONFIDENTIAL PLEASE NOTE: The information contained in this message is privileged and confidential, and is intended only for the use of the individual to whom it is addressed and others who have been specifically authorized to receive it. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, or if any problems occur with transmission, please contact sender. Thank you.
RE: CSV bulk loading question
Thanks James Any explanation to why do we see this difference after all we are writing the same data in both cases -Original Message- From: James Taylor [mailto:jamestay...@apache.org] Sent: Tuesday, March 10, 2015 11:08 AM To: user Subject: Re: CSV bulk loading question Hi Noam, We're tuning CSV bulk load in PHOENIX-1711, but it won't get you a 7x speedup (maybe 30% at the most if we're lucky). The other thing you'd lose by writing all values into one column is incremental update speed which may or may not apply for your use case. To update a single value, you'd need to read the existing value, stitch in the new value and write it back out again. If your use case doesn't require incremental update, you can write them all in a single column and still query and access them individually. The easiest way I can think of doing this would be to store them in a protobuf and serialize it into a VARBINARY column. Then create a built-in function that allows accessing them positionally or by name. Once you have that, could even add functional indexes over individual fields (obviously with the overhead that indexes add). PHOENIX-477 is about formalizing this as SQL Structs which this could become if taken far enough. Even without this, just having a set of built-in functions that work off of a protobuf would be a useful first step and a great contribution. Thanks, James On Mon, Mar 9, 2015 at 11:03 PM, Bulvik, Noam noam.bul...@teoco.com wrote: Hi, We are using the CSV bulk loading (MR) to load our data. we have a table with 50 columns and We did some testing to understand the factors on the performance of loading. We compared two cases A - each column in the data will be a column in hbase table B – take all non-key column and put them in one column in the hbase table We saw that the second option we 7 times faster than the first one and consumed les CPU resources. Does this make sense? Can we do something to tune the system so option A will run faster? (we prefer it this way because it enables us to query and filter over all data columns) Regards, Noam Bulvik PRIVILEGED AND CONFIDENTIAL PLEASE NOTE: The information contained in this message is privileged and confidential, and is intended only for the use of the individual to whom it is addressed and others who have been specifically authorized to receive it. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, or if any problems occur with transmission, please contact sender. Thank you. PRIVILEGED AND CONFIDENTIAL PLEASE NOTE: The information contained in this message is privileged and confidential, and is intended only for the use of the individual to whom it is addressed and others who have been specifically authorized to receive it. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, or if any problems occur with transmission, please contact sender. Thank you.