Hi Constantin,

This in useful info. Just to clarify slowdown that you see after update
statistics case, was the *update statistics* executed after initial data
load or after you upserted it again?

2. How many regions/region servers (RS) did the data end up on?

3. 30/60 seconds for count(*) seems really high. Do you see lots of disk
I/O during your count query? How wide are your rows and how much memory is
available on your RS/HBase heap?

3. Can you also send output of *explain select count(*) from tablex* for
this case?

Thanks,
Mujtaba

On Fri, Feb 13, 2015 at 12:34 AM, Ciureanu, Constantin (GfK) <
constantin.ciure...@gfk.com> wrote:

> Hello Mujtaba,
>
>
>
> Don’t worry – it was just the *select count(*) from tableX* that was
> slowed down in a more than visible way.
>
> I presume all the regular queries do actually benefit from using the STATS.
>
>
>
> Some other cases where I saw slowdown for “*select count(*) from tableX*”:
>
> -       First time after loading 6 M records – the time to obtain the
> count was ~30 sec
>
> -       After loading the *same* 6 M records again – the time almost
> doubled L I imagine the data is doubled, not yet compacted in HBase
>
> -       After deleting the 6M rows (delete from …. , not truncate) and
> loading the 6M rows again – the same double time – same comment as above
>
> -       After update statistics tableX – the time was around 2x the
> original time (~60 seconds) – this I couldn’t really explain (perhaps the
> fleet I use is undersized)
>
>
>
> I need to mention that I’m using 4.2.2 but I can’t wait for 4.3 to be
> released (as it will fix some issues I have. Eg. one with SKIP SCAN:
>
> [1st part of PK between A and B] or [first part of PK between C and D] or
> [….] was understood as a full table scan = painfully slow -> but this
> worked today after I used a hint in the SELECT /*+ SKIP_SCAN */ which
> shouldn’t be mandatory in my opinion).
>
>
>
> Regards,
>
>   Constantin
>
>
>
> *From:* Mujtaba Chohan [mailto:mujt...@apache.org]
> *Sent:* Thursday, February 12, 2015 9:20 PM
>
> *To:* user@phoenix.apache.org
> *Subject:* Re: Update statistics made query 2-3x slower
>
>
>
> Constantin - If possible can you please share your schema, approx.
> row/columns width, number of region servers in your cluster plus their heap
> size, HBase/Phoenix version and any default property overrides so we can
> identify why stats are slowing things down in your case.
>
>
>
> Thanks,
>
> Mujtaba
>
>
>
> On Thu, Feb 12, 2015 at 12:56 AM, Ciureanu, Constantin (GfK) <
> constantin.ciure...@gfk.com> wrote:
>
> It worked!
>
> Without stats it’s again faster (2-3x times) – but I do understand that
> all other normal queries might benefit from the stats.
>
>
>
> Thank you Mujtaba for the info,
>
> Thank you Vasudevan for the explanations, I already used HBase and I agree
> it’s hard to have a counter for the table rows (especially if the
> tombstones for deleted rows are still there – ie. not compacted yet).
>
>
>
> Constantin
>
>
>
>
>
>
>
> *From:* Mujtaba Chohan [mailto:mujt...@apache.org]
> *Sent:* Wednesday, February 11, 2015 8:54 PM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Update statistics made query 2-3x slower
>
>
>
> To compare performance without stats, try deleting related rows from
> SYSTEM.STATS or an easier way, just truncate SYSTEM.STATS table from HBase
> shell and restart your region servers.
>
> //mujtaba
>
>
>
> On Wed, Feb 11, 2015 at 10:29 AM, Vasudevan, Ramkrishna S <
> ramkrishna.s.vasude...@intel.com> wrote:
>
> Hi Constantin
>
>
>
> Before I could explain on the slowness part let me answer your 2nd
> question,
>
>
>
> Phoenix is on top of HBase. HBase is a distributed NoSQL DB. So the data
> that is residing inside logical entities called regions are spread across
> different nodes (region servers).  There is nothing like a table that is in
> one location where you can keep updating the count of rows that is getting
> inserted.
>
>
>
> Which means that when you need  count(*) you may have to aggregate the
> count from every region distributed across region servers. So in other
> words a table is not a single entity it is a collection of regions.
>
>
>
> Coming to your slowness in query, the update statistics query allows you
> to parallelize the query into logical chunks on a single region.  Suppose
> there are 100K rows in a region the statistics collected would allow you to
> run a query parallely for eg say execute parallely on 10 equal chunks of
> 10000 rows within that region.
>
>
>
> Have you modified any of the parameters related to statistics like this
> one ‘phoenix.stats.guidepost.width’.
>
>
>
>
>
> Regards
>
> Ram
>
> *From:* Ciureanu, Constantin (GfK) [mailto:constantin.ciure...@gfk.com]
> *Sent:* Wednesday, February 11, 2015 2:51 PM
> *To:* user@phoenix.apache.org
> *Subject:* Update statistics made query 2-3x slower
>
>
>
> Hello all,
>
>
>
> 1.     Is there a good explanation why updating the statistics:
>
> *update statistics tableX;*
>
>
>
> made this query 2x times slower?   (it was 27 seconds before, now it’s
> somewhere between 60 – 90 seconds)
>
> *select count(*) from tableX;*
>
> +------------------------------------------+
>
> |                 COUNT(1)                 |
>
> +------------------------------------------+
>
> | 5786227                                  |
>
> +------------------------------------------+
>
> 1 row selected (62.718 seconds)
>
>
>
> (If possible J ) how can I “drop” those statistics?
>
>
>
> 2. Why there is nothing (like a counter / attribute for the table) to
> obtain the number of rows in one table fast?
>
>
>
> Thank you,
>
>    Constantin
>
>
>
>
>

Reply via email to