Re: Slow query help
Thanks everybody for the help. I'm just curios to understand why the first query didn't complete. The query is quite complex but the available memory should be more than enough. MYTABLE has 222,547,674 rows. On Parquet it takes 15 GB, while uncompressed (in memory during the download of the data) about 600 GB. There are 144,149,639 rows where SOMEFIELD is not null and it has 37,449,596 distinct values (the table field is an unlimited VARCHAR but all values, if any, has a max length of 20 chars). HBase has 3 region servers with 64 GB each. I've tried to give up to 32 GB of memory but the "sloppy" query wasn't able to finish. I thought that the count of distinct values on a single field shouldn't be that heavy in the end (despite the source table is very big). Am I wrong? Any help is welcome. Flavio On Fri, Mar 16, 2018 at 7:08 PM, Samarth Jainwrote: > A less resource intensive approach would be to use approx count distinct - > https://phoenix.apache.org/language/functions.html#approx_count_distinct > > You would still need the secondary index though, as James suggested, if > you want it to run fast. > > On Fri, Mar 16, 2018 at 10:26 AM Flavio Pompermaier > wrote: > >> Thanks for the tip James. I didn't know that syntax for doing the count >> on a distinct value! >> This version is able to end, the first one wasn't able to finish even >> giving a huge amount of memory to HBase (the cardinality of SOMEFIELD is >> very big indeed). >> >> Thanks a lot, >> Flavio >> >> On Fri, Mar 16, 2018 at 5:45 PM, James Taylor >> wrote: >> >>> Hi Flavio, >>> You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD + VALID) >>> to speed that up. You can write it more simply as SELECT COUNT(DISTINCT >>> SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS NOT NULL. >>> >>> Otherwise, you'll end up doing a full table scan (and use a fair amount >>> of memory depending on the cardinality of SOMEFIELD). The above with a >>> secondary index would skip to the distinct values instead. >>> >>> Thanks, >>> James >>> >>> On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaier < >>> pomperma...@okkam.it> wrote: >>> Hi to all, I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2): SELECT COUNT(*) FROM ( SELECT DISTINCT(SOMEFIELD) FROM TEST.MYTABLE WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL ) Unfortunately the query timeouts (timeout is 10 min) Any suggestion about how to tune my installation? Best, Flavi >>> >>
Re: Slow query help
A less resource intensive approach would be to use approx count distinct - https://phoenix.apache.org/language/functions.html#approx_count_distinct You would still need the secondary index though, as James suggested, if you want it to run fast. On Fri, Mar 16, 2018 at 10:26 AM Flavio Pompermaierwrote: > Thanks for the tip James. I didn't know that syntax for doing the count on > a distinct value! > This version is able to end, the first one wasn't able to finish even > giving a huge amount of memory to HBase (the cardinality of SOMEFIELD is > very big indeed). > > Thanks a lot, > Flavio > > On Fri, Mar 16, 2018 at 5:45 PM, James Taylor > wrote: > >> Hi Flavio, >> You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD + VALID) >> to speed that up. You can write it more simply as SELECT COUNT(DISTINCT >> SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS NOT NULL. >> >> Otherwise, you'll end up doing a full table scan (and use a fair amount >> of memory depending on the cardinality of SOMEFIELD). The above with a >> secondary index would skip to the distinct values instead. >> >> Thanks, >> James >> >> On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaier > > wrote: >> >>> Hi to all, >>> I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2): >>> >>> SELECT COUNT(*) FROM ( >>> SELECT DISTINCT(SOMEFIELD) >>> FROM TEST.MYTABLE >>> WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL >>> ) >>> >>> Unfortunately the query timeouts (timeout is 10 min) Any suggestion >>> about how to tune my installation? >>> >>> Best, >>> Flavi >>> >> >
Re: Slow query help
Thanks for the tip James. I didn't know that syntax for doing the count on a distinct value! This version is able to end, the first one wasn't able to finish even giving a huge amount of memory to HBase (the cardinality of SOMEFIELD is very big indeed). Thanks a lot, Flavio On Fri, Mar 16, 2018 at 5:45 PM, James Taylorwrote: > Hi Flavio, > You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD + VALID) > to speed that up. You can write it more simply as SELECT COUNT(DISTINCT > SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS NOT NULL. > > Otherwise, you'll end up doing a full table scan (and use a fair amount of > memory depending on the cardinality of SOMEFIELD). The above with a > secondary index would skip to the distinct values instead. > > Thanks, > James > > On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaier > wrote: > >> Hi to all, >> I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2): >> >> SELECT COUNT(*) FROM ( >> SELECT DISTINCT(SOMEFIELD) >> FROM TEST.MYTABLE >> WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL >> ) >> >> Unfortunately the query timeouts (timeout is 10 min) Any suggestion >> about how to tune my installation? >> >> Best, >> Flavi >> >
Re: Slow query help
Hi Flavio, You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD + VALID) to speed that up. You can write it more simply as SELECT COUNT(DISTINCT SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS NOT NULL. Otherwise, you'll end up doing a full table scan (and use a fair amount of memory depending on the cardinality of SOMEFIELD). The above with a secondary index would skip to the distinct values instead. Thanks, James On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaierwrote: > Hi to all, > I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2): > > SELECT COUNT(*) FROM ( > SELECT DISTINCT(SOMEFIELD) > FROM TEST.MYTABLE > WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL > ) > > Unfortunately the query timeouts (timeout is 10 min) Any suggestion > about how to tune my installation? > > Best, > Flavi >
Slow query help
Hi to all, I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2): SELECT COUNT(*) FROM ( SELECT DISTINCT(SOMEFIELD) FROM TEST.MYTABLE WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL ) Unfortunately the query timeouts (timeout is 10 min) Any suggestion about how to tune my installation? Best, Flavi