Hi Constantin, This in useful info. Just to clarify slowdown that you see after update statistics case, was the *update statistics* executed after initial data load or after you upserted it again?
2. How many regions/region servers (RS) did the data end up on? 3. 30/60 seconds for count(*) seems really high. Do you see lots of disk I/O during your count query? How wide are your rows and how much memory is available on your RS/HBase heap? 3. Can you also send output of *explain select count(*) from tablex* for this case? Thanks, Mujtaba On Fri, Feb 13, 2015 at 12:34 AM, Ciureanu, Constantin (GfK) < constantin.ciure...@gfk.com> wrote: > Hello Mujtaba, > > > > Don’t worry – it was just the *select count(*) from tableX* that was > slowed down in a more than visible way. > > I presume all the regular queries do actually benefit from using the STATS. > > > > Some other cases where I saw slowdown for “*select count(*) from tableX*”: > > - First time after loading 6 M records – the time to obtain the > count was ~30 sec > > - After loading the *same* 6 M records again – the time almost > doubled L I imagine the data is doubled, not yet compacted in HBase > > - After deleting the 6M rows (delete from …. , not truncate) and > loading the 6M rows again – the same double time – same comment as above > > - After update statistics tableX – the time was around 2x the > original time (~60 seconds) – this I couldn’t really explain (perhaps the > fleet I use is undersized) > > > > I need to mention that I’m using 4.2.2 but I can’t wait for 4.3 to be > released (as it will fix some issues I have. Eg. one with SKIP SCAN: > > [1st part of PK between A and B] or [first part of PK between C and D] or > [….] was understood as a full table scan = painfully slow -> but this > worked today after I used a hint in the SELECT /*+ SKIP_SCAN */ which > shouldn’t be mandatory in my opinion). > > > > Regards, > > Constantin > > > > *From:* Mujtaba Chohan [mailto:mujt...@apache.org] > *Sent:* Thursday, February 12, 2015 9:20 PM > > *To:* user@phoenix.apache.org > *Subject:* Re: Update statistics made query 2-3x slower > > > > Constantin - If possible can you please share your schema, approx. > row/columns width, number of region servers in your cluster plus their heap > size, HBase/Phoenix version and any default property overrides so we can > identify why stats are slowing things down in your case. > > > > Thanks, > > Mujtaba > > > > On Thu, Feb 12, 2015 at 12:56 AM, Ciureanu, Constantin (GfK) < > constantin.ciure...@gfk.com> wrote: > > It worked! > > Without stats it’s again faster (2-3x times) – but I do understand that > all other normal queries might benefit from the stats. > > > > Thank you Mujtaba for the info, > > Thank you Vasudevan for the explanations, I already used HBase and I agree > it’s hard to have a counter for the table rows (especially if the > tombstones for deleted rows are still there – ie. not compacted yet). > > > > Constantin > > > > > > > > *From:* Mujtaba Chohan [mailto:mujt...@apache.org] > *Sent:* Wednesday, February 11, 2015 8:54 PM > *To:* user@phoenix.apache.org > *Subject:* Re: Update statistics made query 2-3x slower > > > > To compare performance without stats, try deleting related rows from > SYSTEM.STATS or an easier way, just truncate SYSTEM.STATS table from HBase > shell and restart your region servers. > > //mujtaba > > > > On Wed, Feb 11, 2015 at 10:29 AM, Vasudevan, Ramkrishna S < > ramkrishna.s.vasude...@intel.com> wrote: > > Hi Constantin > > > > Before I could explain on the slowness part let me answer your 2nd > question, > > > > Phoenix is on top of HBase. HBase is a distributed NoSQL DB. So the data > that is residing inside logical entities called regions are spread across > different nodes (region servers). There is nothing like a table that is in > one location where you can keep updating the count of rows that is getting > inserted. > > > > Which means that when you need count(*) you may have to aggregate the > count from every region distributed across region servers. So in other > words a table is not a single entity it is a collection of regions. > > > > Coming to your slowness in query, the update statistics query allows you > to parallelize the query into logical chunks on a single region. Suppose > there are 100K rows in a region the statistics collected would allow you to > run a query parallely for eg say execute parallely on 10 equal chunks of > 10000 rows within that region. > > > > Have you modified any of the parameters related to statistics like this > one ‘phoenix.stats.guidepost.width’. > > > > > > Regards > > Ram > > *From:* Ciureanu, Constantin (GfK) [mailto:constantin.ciure...@gfk.com] > *Sent:* Wednesday, February 11, 2015 2:51 PM > *To:* user@phoenix.apache.org > *Subject:* Update statistics made query 2-3x slower > > > > Hello all, > > > > 1. Is there a good explanation why updating the statistics: > > *update statistics tableX;* > > > > made this query 2x times slower? (it was 27 seconds before, now it’s > somewhere between 60 – 90 seconds) > > *select count(*) from tableX;* > > +------------------------------------------+ > > | COUNT(1) | > > +------------------------------------------+ > > | 5786227 | > > +------------------------------------------+ > > 1 row selected (62.718 seconds) > > > > (If possible J ) how can I “drop” those statistics? > > > > 2. Why there is nothing (like a counter / attribute for the table) to > obtain the number of rows in one table fast? > > > > Thank you, > > Constantin > > > > >