Any insight here..? On Fri, Mar 16, 2018 at 7:23 PM, Flavio Pompermaier <pomperma...@okkam.it> wrote:
> Thanks everybody for the help. > I'm just curios to understand why the first query didn't complete. The > query is quite complex but the available memory should be more than enough. > > MYTABLE has 222,547,674 rows. On Parquet it takes 15 GB, while > uncompressed (in memory during the download of the data) about 600 GB. > There are 144,149,639 rows where SOMEFIELD is not null and it has 37,449,596 > distinct values (the table field is an unlimited VARCHAR but all values, if > any, has a max length of 20 chars). > HBase has 3 region servers with 64 GB each. > I've tried to give up to 32 GB of memory but the "sloppy" query wasn't > able to finish. > I thought that the count of distinct values on a single field shouldn't be > that heavy in the end (despite the source table is very big). > Am I wrong? > > Any help is welcome. > Flavio > > On Fri, Mar 16, 2018 at 7:08 PM, Samarth Jain <samarth.j...@gmail.com> > wrote: > >> A less resource intensive approach would be to use approx count distinct >> - https://phoenix.apache.org/language/functions.html#approx_ >> count_distinct >> >> You would still need the secondary index though, as James suggested, if >> you want it to run fast. >> >> On Fri, Mar 16, 2018 at 10:26 AM Flavio Pompermaier <pomperma...@okkam.it> >> wrote: >> >>> Thanks for the tip James. I didn't know that syntax for doing the count >>> on a distinct value! >>> This version is able to end, the first one wasn't able to finish even >>> giving a huge amount of memory to HBase (the cardinality of SOMEFIELD is >>> very big indeed). >>> >>> Thanks a lot, >>> Flavio >>> >>> On Fri, Mar 16, 2018 at 5:45 PM, James Taylor <jamestay...@apache.org> >>> wrote: >>> >>>> Hi Flavio, >>>> You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD + >>>> VALID) to speed that up. You can write it more simply as SELECT >>>> COUNT(DISTINCT SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS >>>> NOT NULL. >>>> >>>> Otherwise, you'll end up doing a full table scan (and use a fair amount >>>> of memory depending on the cardinality of SOMEFIELD). The above with a >>>> secondary index would skip to the distinct values instead. >>>> >>>> Thanks, >>>> James >>>> >>>> On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaier < >>>> pomperma...@okkam.it> wrote: >>>> >>>>> Hi to all, >>>>> I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2): >>>>> >>>>> SELECT COUNT(*) FROM ( >>>>> SELECT DISTINCT(SOMEFIELD) >>>>> FROM TEST.MYTABLE >>>>> WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL >>>>> ) >>>>> >>>>> Unfortunately the query timeouts (timeout is 10 min).... Any >>>>> suggestion about how to tune my installation? >>>>> >>>>> Best, >>>>> Flavi >>>>> >>>> >>> >