Problems building Phoenix 4.2.2 with BigTop
I have an Apache BigTop VM setup. I verified I could build Apache Phoenix 4.2.1. But, it fails after I change the version number to 4.2.2. I changed the /usr/src/bigtop/bigtop.mk in two places. # Phoenix PHOENIX_NAME=phoenix PHOENIX_RELNOTES_NAME=Phoenix: A SQL skin over HBase PHOENIX_PKG_NAME=phoenix PHOENIX_BASE_VERSION=4.2.2 PHOENIX_PKG_VERSION=4.2.2 PHOENIX_RELEASE_VERSION=1 PHOENIX_TARBALL_DST=phoenix-$(PHOENIX_BASE_VERSION)-src.tar.gz PHOENIX_TARBALL_SRC=$(PHOENIX_TARBALL_DST) PHOENIX_DOWNLOAD_PATH=/phoenix/phoenix-$(PHOENIX_BASE_VERSION)/src/ PHOENIX_SITE=$(APACHE_MIRROR)$(PHOENIX_DOWNLOAD_PATH) PHOENIX_ARCHIVE=$(APACHE_ARCHIVE)$(PHOENIX_DOWNLOAD_PATH) $(eval $(call PACKAGE,phoenix,PHOENIX)) When I build using the command line I get this error: [INFO] [INFO] [INFO] Building Apache Phoenix 4.2.2 [INFO] [INFO] [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ phoenix --- [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ phoenix --- [INFO] [INFO] --- maven-source-plugin:2.2.1:jar-no-fork (attach-sources) @ phoenix --- [INFO] [INFO] --- maven-jar-plugin:2.4:test-jar (default) @ phoenix --- [WARNING] JAR will be empty - no content was marked for inclusion! [INFO] Building jar: /usr/src/bigtop/build/phoenix/rpm/BUILD/phoenix-4.2.2-src/target/phoenix-4.2.2-tests.jar [INFO] [INFO] --- maven-site-plugin:3.2:attach-descriptor (attach-descriptor) @ phoenix --- [INFO] [INFO] [INFO] Building Phoenix Core 4.2.2 [INFO] [WARNING] The POM for org.iq80.snappy:snappy:jar:1.0.4.1 is missing, no dependency information available [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Phoenix SUCCESS [4.173s] [INFO] Phoenix Core .. FAILURE [2.799s] [INFO] Phoenix - Flume ... SKIPPED [INFO] Phoenix - Pig . SKIPPED [INFO] Phoenix Assembly .. SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 7.834s [INFO] Finished at: Mon Dec 15 14:13:38 UTC 2014 [INFO] Final Memory: 20M/289M [INFO] [ERROR] Failed to execute goal on project phoenix-core: Could not resolve dependencies for project org.apache.phoenix:phoenix-core:jar:4.2.2: Failure to find org.iq80.snappy:snappy:jar:1.0.4.1 in https://repository.apache.org/content/repositories/releases/ was cached in the local repository, resolution will not be reattempted until the update interval of apache release has elapsed or updates are forced - [Help 1] [ERROR] I am new to using BigTop. Is there another setting I need to configure? -Jerry
queries getting rejected
Hi, We resized (reduced) our cluster size. Running any queries that aggregate on our datasets, I am getting the following errors. After the first query, that error returns immediately AND even for an EXPLAIN. I saw a couple of other posts about this and the recommendation is to jack up the threadPool and queue configs. Is that right? Abe Task org.apache.phoenix.job.JobManager$JobFutureTask@513c74aa rejected from org.apache.phoenix.job.JobManager$1@6312e0f8[Running, pool size = 128, active threads = 128, queued tasks = 500, completed tasks = 2183]
Re: Problems building Phoenix 4.2.2 with BigTop
I think I know what is going on. The Phoenix POM defines the property snappy.version, and sets it to 0.3, this is the version of org.iq80.snappy we want. However, I bet the Bigtop build scaffold launches the Phoenix build with snappy.version set to 1.0.4.1. We should probably fix that with a Bigtop JIRA. On Mon, Dec 15, 2014 at 6:16 AM, Chabot, Jerry jerry.p.cha...@hp.com wrote: I have an Apache BigTop VM setup. I verified I could build Apache Phoenix 4.2.1. But, it fails after I change the version number to 4.2.2. I changed the /usr/src/bigtop/bigtop.mk in two places. # Phoenix PHOENIX_NAME=phoenix PHOENIX_RELNOTES_NAME=Phoenix: A SQL skin over HBase PHOENIX_PKG_NAME=phoenix PHOENIX_BASE_VERSION=*4.2.2* PHOENIX_PKG_VERSION=*4.2.2* PHOENIX_RELEASE_VERSION=1 PHOENIX_TARBALL_DST=phoenix-$(PHOENIX_BASE_VERSION)-src.tar.gz PHOENIX_TARBALL_SRC=$(PHOENIX_TARBALL_DST) PHOENIX_DOWNLOAD_PATH=/phoenix/phoenix-$(PHOENIX_BASE_VERSION)/src/ PHOENIX_SITE=$(APACHE_MIRROR)$(PHOENIX_DOWNLOAD_PATH) PHOENIX_ARCHIVE=$(APACHE_ARCHIVE)$(PHOENIX_DOWNLOAD_PATH) $(eval $(call PACKAGE,phoenix,PHOENIX)) When I build using the command line I get this error: [INFO] [INFO] [INFO] Building Apache Phoenix 4.2.2 [INFO] [INFO] [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ phoenix --- [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ phoenix --- [INFO] [INFO] --- maven-source-plugin:2.2.1:jar-no-fork (attach-sources) @ phoenix --- [INFO] [INFO] --- maven-jar-plugin:2.4:test-jar (default) @ phoenix --- [WARNING] JAR will be empty - no content was marked for inclusion! [INFO] Building jar: /usr/src/bigtop/build/phoenix/rpm/BUILD/phoenix-4.2.2-src/target/phoenix-4.2.2-tests.jar [INFO] [INFO] --- maven-site-plugin:3.2:attach-descriptor (attach-descriptor) @ phoenix --- [INFO] [INFO] [INFO] Building Phoenix Core 4.2.2 [INFO] [WARNING] The POM for org.iq80.snappy:snappy:jar:1.0.4.1 is missing, no dependency information available [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Phoenix SUCCESS [4.173s] [INFO] Phoenix Core .. FAILURE [2.799s] [INFO] Phoenix - Flume ... SKIPPED [INFO] Phoenix - Pig . SKIPPED [INFO] Phoenix Assembly .. SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 7.834s [INFO] Finished at: Mon Dec 15 14:13:38 UTC 2014 [INFO] Final Memory: 20M/289M [INFO] [ERROR] Failed to execute goal on project phoenix-core: Could not resolve dependencies for project org.apache.phoenix:phoenix-core:jar:4.2.2: Failure to find org.iq80.snappy:snappy:jar:1.0.4.1 in https://repository.apache.org/content/repositories/releases/ was cached in the local repository, resolution will not be reattempted until the update interval of apache release has elapsed or updates are forced - [Help 1] [ERROR] I am new to using BigTop. Is there another setting I need to configure? -Jerry -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
Re: queries getting rejected
What version are you using, Abe? The important parameter to bump up if you get a rejected exception is the queue depth (phoenix.query.queueSize). When you upgrade to 4.2.x, the default for this went up to 5000, the reason being that we chunk up work into smaller pieces. If you're already overriding this, make sure you remove that override (or increase it substantially). Thanks, James On Mon, Dec 15, 2014 at 10:08 AM, Abe Weinograd a...@flonet.com wrote: Hi, We resized (reduced) our cluster size. Running any queries that aggregate on our datasets, I am getting the following errors. After the first query, that error returns immediately AND even for an EXPLAIN. I saw a couple of other posts about this and the recommendation is to jack up the threadPool and queue configs. Is that right? Abe Task org.apache.phoenix.job.JobManager$JobFutureTask@513c74aa rejected from org.apache.phoenix.job.JobManager$1@6312e0f8[Running, pool size = 128, active threads = 128, queued tasks = 500, completed tasks = 2183]
Re: Query performance question
My apologies for the delay. I had to switch clusters and use a smaller dataset. The discrepancy still exists but the numbers are a little different: I ran the same queries as in the original email (below) Total records: 581M Simple query based on secondary index value used in the subselect: 1747 recs – 0.256 sec Subselect query: 177s Join query: 179s The answers to your questions are below. 1. What is the Primary Key definition of your BULK_TABLE? CONSTRAINT pkey PRIMARY KEY (file_id,recnum) 2. How many (approximately) distinct file_id values are there in the BULK_TABLE? (If you don't know for sure, you can just run a query to find out). select count(distinct(file_id)) from BULK_TABLE” - 7902 - 92s select distinct(file_id) from BULK_TABLE” - returns in 25m 3. How long does it take to run a full-scan query on BULK_TABLE, like select * from BULK_TABLE”? Results began returning after about 25min 4. How long does it take to run a full-scan join query on BULK_TABLE, like select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum”? The full-scan join fails with a MaxServerCacheSizeExceededException - server cache set to 1G. Custom hbase/phoenix settings are attached. Thanks, Ralph From: Maryann Xue maryann@gmail.commailto:maryann@gmail.com Reply-To: user@phoenix.apache.orgmailto:user@phoenix.apache.org user@phoenix.apache.orgmailto:user@phoenix.apache.org Date: Friday, December 12, 2014 at 8:07 AM To: user@phoenix.apache.orgmailto:user@phoenix.apache.org user@phoenix.apache.orgmailto:user@phoenix.apache.org Subject: Re: Query performance question Hi Ralph, Thanks for the question! According to the explain result you got, the optimization worked exactly as expected with this query: DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID) means a skip-scan instead of a full-scan over BULK_TABLE will be executed at runtime based on the values of file_id it got from the inner query. So I need to know a few more things: 1. What is the Primary Key definition of your BULK_TABLE? 2. How many (approximately) distinct file_id values are there in the BULK_TABLE? (If you don't know for sure, you can just run a query to find out). 3. How long does it take to run a full-scan query on BULK_TABLE, like select * from BULK_TABLE? 4. How long does it take to run a full-scan join query on BULK_TABLE, like select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum? Thanks, Maryann On Thu, Dec 11, 2014 at 6:28 PM, Perko, Ralph J ralph.pe...@pnnl.govmailto:ralph.pe...@pnnl.gov wrote: Hi, Thanks for all your help thus far with Phoenix. I am trying to understand the best way to construct a query that returns all the fields from a table but still takes advantage of a single field secondary index. I have a table with upwards of 50 fields and do not wish to index them all but the use case exists to return them all. My general approach is to first select the records I want using an indexed field then use the returned pk values to get the entire record in the form of a subselect or join. The initial select executes very fast, sub-second , returning close to 3000 records. When used as a subselect or join the entire query takes very long (over 15min) or does not return. Based on the processing plans it appears there is a lot more going on than just a simple look-up of the values returned in the subselect. Is there a way to do this using Phoenix SQL syntax? Any suggestions are appreciated. Initial indexed query (very fast): SELECT file_id,recnum FROM BULK_TABLE WHERE saddr IN (ip1,ip2,ip3)) file_id and recnum make up the primary key Plan: | CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076] - [9,1,000,142,114] | | CLIENT MERGE SORT | Used as a subselect (times out): SELECT * FROM BULK_TABLE WHERE (file_id,recnum) IN(SELECT file_id,recnum FROM BULK_TABLE WHERE saddr IN (ip1,ip2,ip3)); Plan: | CLIENT PARALLEL 10-WAY FULL SCAN OVER BULK_TABLE | | CLIENT MERGE SORT | | PARALLEL SEMI-JOIN TABLE 0 (SKIP MERGE) | | CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076] - [9,1,000,142,114] | | SERVER AGGREGATE INTO DISTINCT ROWS BY [RECNUM, FILE_ID] | | CLIENT MERGE SORT | | DYNAMIC SERVER FILTER BY FILE_ID IN ($1.$3) | Another approach using using a join instead: SELECT * FROM BULK_TABLE JOIN (SELECT file_id, recnum FROM BULK_TABLE WHERE saddr in (ip1,ip2,ip3)) AS SS ON BULK_TABLE.file_id = SS.file_id AND BULK_TABLE.recnum = SS.recnum; Runs faster but still can take about 15min Plan: | CLIENT PARALLEL 10-WAY FULL SCAN OVER BULK_TABLE | | CLIENT MERGE SORT | | PARALLEL INNER-JOIN TABLE 0 | | CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER
Re: Query performance question
Hi Ralph, Thank you very much for the information! Very helpful for your questions. The numbers look reasonable as opposed to the query plan. But the only problem is both file_id and recnum should be used in DYNAMIC SERVER FILTER (which will turn into a skip-scan), but the plan you got there only used file_id alone. So could you please attach the DDL of your table and index so that I can try reproducing the problem? Thanks, Maryann On Mon, Dec 15, 2014 at 2:46 PM, Perko, Ralph J ralph.pe...@pnnl.gov wrote: My apologies for the delay. I had to switch clusters and use a smaller dataset. The discrepancy still exists but the numbers are a little different: I ran the same queries as in the original email (below) Total records: 581M Simple query based on secondary index value used in the subselect: 1747 recs – 0.256 sec Subselect query: 177s Join query: 179s The answers to your questions are below. 1. What is the Primary Key definition of your BULK_TABLE? CONSTRAINT pkey PRIMARY KEY (file_id,recnum) 2. How many (approximately) distinct file_id values are there in the BULK_TABLE? (If you don't know for sure, you can just run a query to find out). select count(distinct(file_id)) from BULK_TABLE” - 7902 - 92s select distinct(file_id) from BULK_TABLE” - returns in 25m 3. How long does it take to run a full-scan query on BULK_TABLE, like select * from BULK_TABLE”? Results began returning after about 25min 4. How long does it take to run a full-scan join query on BULK_TABLE, like select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum”? The full-scan join fails with a MaxServerCacheSizeExceededException - server cache set to 1G. Custom hbase/phoenix settings are attached. Thanks, Ralph From: Maryann Xue maryann@gmail.com Reply-To: user@phoenix.apache.org user@phoenix.apache.org Date: Friday, December 12, 2014 at 8:07 AM To: user@phoenix.apache.org user@phoenix.apache.org Subject: Re: Query performance question Hi Ralph, Thanks for the question! According to the explain result you got, the optimization worked exactly as expected with this query: DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID) means a skip-scan instead of a full-scan over BULK_TABLE will be executed at runtime based on the values of file_id it got from the inner query. So I need to know a few more things: 1. What is the Primary Key definition of your BULK_TABLE? 2. How many (approximately) distinct file_id values are there in the BULK_TABLE? (If you don't know for sure, you can just run a query to find out). 3. How long does it take to run a full-scan query on BULK_TABLE, like select * from BULK_TABLE? 4. How long does it take to run a full-scan join query on BULK_TABLE, like select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum? Thanks, Maryann On Thu, Dec 11, 2014 at 6:28 PM, Perko, Ralph J ralph.pe...@pnnl.gov wrote: Hi, Thanks for all your help thus far with Phoenix. I am trying to understand the best way to construct a query that returns all the fields from a table but still takes advantage of a single field secondary index. I have a table with upwards of 50 fields and do not wish to index them all but the use case exists to return them all. My general approach is to first select the records I want using an indexed field then use the returned pk values to get the entire record in the form of a subselect or join. The initial select executes very fast, sub-second , returning close to 3000 records. When used as a subselect or join the entire query takes very long (over 15min) or does not return. Based on the processing plans it appears there is a lot more going on than just a simple look-up of the values returned in the subselect. Is there a way to do this using Phoenix SQL syntax? Any suggestions are appreciated. Initial indexed query (very fast): SELECT file_id,recnum FROM BULK_TABLE WHERE saddr IN (ip1,ip2,ip3)) file_id and recnum make up the primary key Plan: | CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076] - [9,1,000,142,114] | | CLIENT MERGE SORT | Used as a subselect (times out): SELECT * FROM BULK_TABLE WHERE (file_id,recnum) IN(SELECT file_id,recnum FROM BULK_TABLE WHERE saddr IN (ip1,ip2,ip3)); Plan: | CLIENT PARALLEL 10-WAY FULL SCAN OVER BULK_TABLE | | CLIENT MERGE SORT | | PARALLEL SEMI-JOIN TABLE 0 (SKIP MERGE) | | CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076] - [9,1,000,142,114] | | SERVER AGGREGATE INTO DISTINCT ROWS BY [RECNUM, FILE_ID] | | CLIENT MERGE SORT | | DYNAMIC SERVER FILTER BY FILE_ID IN ($1.$3)
Re: Query performance question
And one more thing: the version of Phoenix you are running. On Mon, Dec 15, 2014 at 3:21 PM, Maryann Xue maryann@gmail.com wrote: Hi Ralph, Thank you very much for the information! Very helpful for your questions. The numbers look reasonable as opposed to the query plan. But the only problem is both file_id and recnum should be used in DYNAMIC SERVER FILTER (which will turn into a skip-scan), but the plan you got there only used file_id alone. So could you please attach the DDL of your table and index so that I can try reproducing the problem? Thanks, Maryann On Mon, Dec 15, 2014 at 2:46 PM, Perko, Ralph J ralph.pe...@pnnl.gov wrote: My apologies for the delay. I had to switch clusters and use a smaller dataset. The discrepancy still exists but the numbers are a little different: I ran the same queries as in the original email (below) Total records: 581M Simple query based on secondary index value used in the subselect: 1747 recs – 0.256 sec Subselect query: 177s Join query: 179s The answers to your questions are below. 1. What is the Primary Key definition of your BULK_TABLE? CONSTRAINT pkey PRIMARY KEY (file_id,recnum) 2. How many (approximately) distinct file_id values are there in the BULK_TABLE? (If you don't know for sure, you can just run a query to find out). select count(distinct(file_id)) from BULK_TABLE” - 7902 - 92s select distinct(file_id) from BULK_TABLE” - returns in 25m 3. How long does it take to run a full-scan query on BULK_TABLE, like select * from BULK_TABLE”? Results began returning after about 25min 4. How long does it take to run a full-scan join query on BULK_TABLE, like select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum”? The full-scan join fails with a MaxServerCacheSizeExceededException - server cache set to 1G. Custom hbase/phoenix settings are attached. Thanks, Ralph From: Maryann Xue maryann@gmail.com Reply-To: user@phoenix.apache.org user@phoenix.apache.org Date: Friday, December 12, 2014 at 8:07 AM To: user@phoenix.apache.org user@phoenix.apache.org Subject: Re: Query performance question Hi Ralph, Thanks for the question! According to the explain result you got, the optimization worked exactly as expected with this query: DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID) means a skip-scan instead of a full-scan over BULK_TABLE will be executed at runtime based on the values of file_id it got from the inner query. So I need to know a few more things: 1. What is the Primary Key definition of your BULK_TABLE? 2. How many (approximately) distinct file_id values are there in the BULK_TABLE? (If you don't know for sure, you can just run a query to find out). 3. How long does it take to run a full-scan query on BULK_TABLE, like select * from BULK_TABLE? 4. How long does it take to run a full-scan join query on BULK_TABLE, like select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum? Thanks, Maryann On Thu, Dec 11, 2014 at 6:28 PM, Perko, Ralph J ralph.pe...@pnnl.gov wrote: Hi, Thanks for all your help thus far with Phoenix. I am trying to understand the best way to construct a query that returns all the fields from a table but still takes advantage of a single field secondary index. I have a table with upwards of 50 fields and do not wish to index them all but the use case exists to return them all. My general approach is to first select the records I want using an indexed field then use the returned pk values to get the entire record in the form of a subselect or join. The initial select executes very fast, sub-second , returning close to 3000 records. When used as a subselect or join the entire query takes very long (over 15min) or does not return. Based on the processing plans it appears there is a lot more going on than just a simple look-up of the values returned in the subselect. Is there a way to do this using Phoenix SQL syntax? Any suggestions are appreciated. Initial indexed query (very fast): SELECT file_id,recnum FROM BULK_TABLE WHERE saddr IN (ip1,ip2,ip3)) file_id and recnum make up the primary key Plan: | CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076] - [9,1,000,142,114] | | CLIENT MERGE SORT | Used as a subselect (times out): SELECT * FROM BULK_TABLE WHERE (file_id,recnum) IN(SELECT file_id,recnum FROM BULK_TABLE WHERE saddr IN (ip1,ip2,ip3)); Plan: | CLIENT PARALLEL 10-WAY FULL SCAN OVER BULK_TABLE | | CLIENT MERGE SORT | | PARALLEL SEMI-JOIN TABLE 0 (SKIP MERGE) | | CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076] - [9,1,000,142,114] | |
Re: Query performance question
DDL is attached – thanks! Ralph From: Maryann Xue maryann@gmail.commailto:maryann@gmail.com Reply-To: user@phoenix.apache.orgmailto:user@phoenix.apache.org user@phoenix.apache.orgmailto:user@phoenix.apache.org Date: Monday, December 15, 2014 at 12:21 PM To: user@phoenix.apache.orgmailto:user@phoenix.apache.org user@phoenix.apache.orgmailto:user@phoenix.apache.org Subject: Re: Query performance question Hi Ralph, Thank you very much for the information! Very helpful for your questions. The numbers look reasonable as opposed to the query plan. But the only problem is both file_id and recnum should be used in DYNAMIC SERVER FILTER (which will turn into a skip-scan), but the plan you got there only used file_id alone. So could you please attach the DDL of your table and index so that I can try reproducing the problem? Thanks, Maryann On Mon, Dec 15, 2014 at 2:46 PM, Perko, Ralph J ralph.pe...@pnnl.govmailto:ralph.pe...@pnnl.gov wrote: My apologies for the delay. I had to switch clusters and use a smaller dataset. The discrepancy still exists but the numbers are a little different: I ran the same queries as in the original email (below) Total records: 581M Simple query based on secondary index value used in the subselect: 1747 recs – 0.256 sec Subselect query: 177s Join query: 179s The answers to your questions are below. 1. What is the Primary Key definition of your BULK_TABLE? CONSTRAINT pkey PRIMARY KEY (file_id,recnum) 2. How many (approximately) distinct file_id values are there in the BULK_TABLE? (If you don't know for sure, you can just run a query to find out). select count(distinct(file_id)) from BULK_TABLE” - 7902 - 92s select distinct(file_id) from BULK_TABLE” - returns in 25m 3. How long does it take to run a full-scan query on BULK_TABLE, like select * from BULK_TABLE”? Results began returning after about 25min 4. How long does it take to run a full-scan join query on BULK_TABLE, like select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum”? The full-scan join fails with a MaxServerCacheSizeExceededException - server cache set to 1G. Custom hbase/phoenix settings are attached. Thanks, Ralph From: Maryann Xue maryann@gmail.commailto:maryann@gmail.com Reply-To: user@phoenix.apache.orgmailto:user@phoenix.apache.org user@phoenix.apache.orgmailto:user@phoenix.apache.org Date: Friday, December 12, 2014 at 8:07 AM To: user@phoenix.apache.orgmailto:user@phoenix.apache.org user@phoenix.apache.orgmailto:user@phoenix.apache.org Subject: Re: Query performance question Hi Ralph, Thanks for the question! According to the explain result you got, the optimization worked exactly as expected with this query: DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID) means a skip-scan instead of a full-scan over BULK_TABLE will be executed at runtime based on the values of file_id it got from the inner query. So I need to know a few more things: 1. What is the Primary Key definition of your BULK_TABLE? 2. How many (approximately) distinct file_id values are there in the BULK_TABLE? (If you don't know for sure, you can just run a query to find out). 3. How long does it take to run a full-scan query on BULK_TABLE, like select * from BULK_TABLE? 4. How long does it take to run a full-scan join query on BULK_TABLE, like select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum? Thanks, Maryann On Thu, Dec 11, 2014 at 6:28 PM, Perko, Ralph J ralph.pe...@pnnl.govmailto:ralph.pe...@pnnl.gov wrote: Hi, Thanks for all your help thus far with Phoenix. I am trying to understand the best way to construct a query that returns all the fields from a table but still takes advantage of a single field secondary index. I have a table with upwards of 50 fields and do not wish to index them all but the use case exists to return them all. My general approach is to first select the records I want using an indexed field then use the returned pk values to get the entire record in the form of a subselect or join. The initial select executes very fast, sub-second , returning close to 3000 records. When used as a subselect or join the entire query takes very long (over 15min) or does not return. Based on the processing plans it appears there is a lot more going on than just a simple look-up of the values returned in the subselect. Is there a way to do this using Phoenix SQL syntax? Any suggestions are appreciated. Initial indexed query (very fast): SELECT file_id,recnum FROM BULK_TABLE WHERE saddr IN (ip1,ip2,ip3)) file_id and recnum make up the primary key Plan: | CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076] - [9,1,000,142,114] | | CLIENT MERGE SORT | Used as a subselect (times out): SELECT * FROM BULK_TABLE WHERE (file_id,recnum)
Re: Query performance question
I am using version 4.2.2 From: Perko, Ralph Perko ralph.pe...@pnnl.govmailto:ralph.pe...@pnnl.gov Reply-To: user@phoenix.apache.orgmailto:user@phoenix.apache.org user@phoenix.apache.orgmailto:user@phoenix.apache.org Date: Monday, December 15, 2014 at 12:37 PM To: user@phoenix.apache.orgmailto:user@phoenix.apache.org user@phoenix.apache.orgmailto:user@phoenix.apache.org Subject: Re: Query performance question DDL is attached – thanks! Ralph From: Maryann Xue maryann@gmail.commailto:maryann@gmail.com Reply-To: user@phoenix.apache.orgmailto:user@phoenix.apache.org user@phoenix.apache.orgmailto:user@phoenix.apache.org Date: Monday, December 15, 2014 at 12:21 PM To: user@phoenix.apache.orgmailto:user@phoenix.apache.org user@phoenix.apache.orgmailto:user@phoenix.apache.org Subject: Re: Query performance question Hi Ralph, Thank you very much for the information! Very helpful for your questions. The numbers look reasonable as opposed to the query plan. But the only problem is both file_id and recnum should be used in DYNAMIC SERVER FILTER (which will turn into a skip-scan), but the plan you got there only used file_id alone. So could you please attach the DDL of your table and index so that I can try reproducing the problem? Thanks, Maryann On Mon, Dec 15, 2014 at 2:46 PM, Perko, Ralph J ralph.pe...@pnnl.govmailto:ralph.pe...@pnnl.gov wrote: My apologies for the delay. I had to switch clusters and use a smaller dataset. The discrepancy still exists but the numbers are a little different: I ran the same queries as in the original email (below) Total records: 581M Simple query based on secondary index value used in the subselect: 1747 recs – 0.256 sec Subselect query: 177s Join query: 179s The answers to your questions are below. 1. What is the Primary Key definition of your BULK_TABLE? CONSTRAINT pkey PRIMARY KEY (file_id,recnum) 2. How many (approximately) distinct file_id values are there in the BULK_TABLE? (If you don't know for sure, you can just run a query to find out). select count(distinct(file_id)) from BULK_TABLE” - 7902 - 92s select distinct(file_id) from BULK_TABLE” - returns in 25m 3. How long does it take to run a full-scan query on BULK_TABLE, like select * from BULK_TABLE”? Results began returning after about 25min 4. How long does it take to run a full-scan join query on BULK_TABLE, like select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum”? The full-scan join fails with a MaxServerCacheSizeExceededException - server cache set to 1G. Custom hbase/phoenix settings are attached. Thanks, Ralph From: Maryann Xue maryann@gmail.commailto:maryann@gmail.com Reply-To: user@phoenix.apache.orgmailto:user@phoenix.apache.org user@phoenix.apache.orgmailto:user@phoenix.apache.org Date: Friday, December 12, 2014 at 8:07 AM To: user@phoenix.apache.orgmailto:user@phoenix.apache.org user@phoenix.apache.orgmailto:user@phoenix.apache.org Subject: Re: Query performance question Hi Ralph, Thanks for the question! According to the explain result you got, the optimization worked exactly as expected with this query: DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID) means a skip-scan instead of a full-scan over BULK_TABLE will be executed at runtime based on the values of file_id it got from the inner query. So I need to know a few more things: 1. What is the Primary Key definition of your BULK_TABLE? 2. How many (approximately) distinct file_id values are there in the BULK_TABLE? (If you don't know for sure, you can just run a query to find out). 3. How long does it take to run a full-scan query on BULK_TABLE, like select * from BULK_TABLE? 4. How long does it take to run a full-scan join query on BULK_TABLE, like select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum? Thanks, Maryann On Thu, Dec 11, 2014 at 6:28 PM, Perko, Ralph J ralph.pe...@pnnl.govmailto:ralph.pe...@pnnl.gov wrote: Hi, Thanks for all your help thus far with Phoenix. I am trying to understand the best way to construct a query that returns all the fields from a table but still takes advantage of a single field secondary index. I have a table with upwards of 50 fields and do not wish to index them all but the use case exists to return them all. My general approach is to first select the records I want using an indexed field then use the returned pk values to get the entire record in the form of a subselect or join. The initial select executes very fast, sub-second , returning close to 3000 records. When used as a subselect or join the entire query takes very long (over 15min) or does not return. Based on the processing plans it appears there is a lot more going on than just a simple look-up of the values returned in the subselect. Is there a way to do this using Phoenix
Re: queries getting rejected
Hi James, 4.1 right now. We are currently using defaults. I can try and jacking that up if it makes sense. Right now, this version of DBVis doesn't have our stock hbase-site.xml in its classpath. Do we have to add that or can we get away with setting that somewhere in the query tool. What is the best way to do that? Thanks! Abe On Mon, Dec 15, 2014 at 2:15 PM, James Taylor jamestay...@apache.org wrote: What version are you using, Abe? The important parameter to bump up if you get a rejected exception is the queue depth (phoenix.query.queueSize). When you upgrade to 4.2.x, the default for this went up to 5000, the reason being that we chunk up work into smaller pieces. If you're already overriding this, make sure you remove that override (or increase it substantially). Thanks, James On Mon, Dec 15, 2014 at 10:08 AM, Abe Weinograd a...@flonet.com wrote: Hi, We resized (reduced) our cluster size. Running any queries that aggregate on our datasets, I am getting the following errors. After the first query, that error returns immediately AND even for an EXPLAIN. I saw a couple of other posts about this and the recommendation is to jack up the threadPool and queue configs. Is that right? Abe Task org.apache.phoenix.job.JobManager$JobFutureTask@513c74aa rejected from org.apache.phoenix.job.JobManager$1@6312e0f8[Running, pool size = 128, active threads = 128, queued tasks = 500, completed tasks = 2183]
hbase 0.96.1.1-cdh5.0.1 for phoenix compatibility
Hi,all Just want to confirm the appropriate versions of phoenix for compatibility to hbase 0.96.1.1-cdh5.0.1. Are the latest 4.2.2 release working well for the hbase version? Thanks, Sun. CertusNet
Re: Re: hbase 0.96.1.1-cdh5.0.1 for phoenix compatibility
Hi, James Thanks for your quick response. Just want to make sure that we cannot build phoenix for hbase 0.96. However, is there any possibility that we can use phoenix built for hbase 0.98.x and work well for the 0.96 version? Cause that we would not like to upgrade our current CDH cluster. Thanks, Sun. CertusNet From: James Taylor Date: 2014-12-16 14:34 To: user CC: dev Subject: Re: hbase 0.96.1.1-cdh5.0.1 for phoenix compatibility HBase 0.96 is not supported by Phoenix - only HBase 0.98.1 and above. The CDH 5.1 releases package HBase 0.98, so these are ok. On Mon, Dec 15, 2014 at 10:31 PM, su...@certusnet.com.cn su...@certusnet.com.cn wrote: Hi,all Just want to confirm the appropriate versions of phoenix for compatibility to hbase 0.96.1.1-cdh5.0.1. Are the latest 4.2.2 release working well for the hbase version? Thanks, Sun. CertusNet
Re: Re: hbase 0.96.1.1-cdh5.0.1 for phoenix compatibility
See PHOENIX-848. It's not going to work to use Phoenix against cdh5.0.1. I'd recommend upgrading - HBase 0.98 has made great strides forward. On Mon, Dec 15, 2014 at 10:48 PM, su...@certusnet.com.cn su...@certusnet.com.cn wrote: Hi, James Thanks for your quick response. Just want to make sure that we cannot build phoenix for hbase 0.96. However, is there any possibility that we can use phoenix built for hbase 0.98.x and work well for the 0.96 version? Cause that we would not like to upgrade our current CDH cluster. Thanks, Sun. CertusNet From: James Taylor Date: 2014-12-16 14:34 To: user CC: dev Subject: Re: hbase 0.96.1.1-cdh5.0.1 for phoenix compatibility HBase 0.96 is not supported by Phoenix - only HBase 0.98.1 and above. The CDH 5.1 releases package HBase 0.98, so these are ok. On Mon, Dec 15, 2014 at 10:31 PM, su...@certusnet.com.cn su...@certusnet.com.cn wrote: Hi,all Just want to confirm the appropriate versions of phoenix for compatibility to hbase 0.96.1.1-cdh5.0.1. Are the latest 4.2.2 release working well for the hbase version? Thanks, Sun. CertusNet