Re: Debugging slowness in Gets

2023-12-12 Thread Lars Francke
Hi,

quick update.
We did run with my patch for better 2.x support and got some profiles.
It does indeed look like slowness in HDFS connections.
I still have on my list to see if I can add some "slow logging" in the
relevant paths.
Happy to update this thread.

Cheers,
Lars

On Thu, Dec 7, 2023 at 9:11 AM Lars Francke  wrote:
>
> Hi Nick,
>
> yes, thank you. We have it running in containers now but privileged
> containers are not allowed.
> I created this https://github.com/apache/hbase/pull/5566 to allow the
> "itimer" thing and upgrade to async-profiler in general.
> I hope to have some results next week, happy to update this thread.
>
> I also still hope to be able to add some "slow reads" logging once I
> identify the relevant bits of code (haven't started looking yet and it
> might not be feasible).
>
> Cheers,
> Lars
>
>
> On Wed, Dec 6, 2023 at 12:25 PM Nick Dimiduk  wrote:
> >
> > Hi Lars,
> >
> > I took a look through my k8s PR's against hbase-operator-tools and see no
> > mention at all of async-profiler;  I have nothing to point you towards. I
> > also no longer have access to those systems, so I'm afraid I'm not much
> > further help.
> >
> > I do remember there was some coordination required to enable capture of
> > kernel call stacks [0], but I don't recall the details of deploying this in
> > the kubernetes environment. Further down the doc [1], there is some mention
> > about running within a container, fiddling with the security profile.
> >
> > [0]: https://github.com/async-profiler/async-profiler#basic-usage
> > [1]:
> > https://github.com/async-profiler/async-profiler#profiling-java-in-a-container
> >
> > On Wed, Dec 6, 2023 at 10:03 AM Lars Francke  wrote:
> >
> > > Thanks Nick,
> > > I'll take a look at that.
> > >
> > > I did add async-profiler to our image[1] but haven't had a chance to
> > > test it yet.
> > > Do you remember if you had to run the container with extra privileges?
> > >
> > > I just opened this issue before I saw your mail as it turns out that
> > > not all args are exposed after all :)
> > > https://issues.apache.org/jira/browse/HBASE-28242 (ideas welcome in this
> > > ticket)
> > >
> > > I'll continue this journey and will report back with any issues and
> > > will see if I can improve anything
> > >
> > > [1] <
> > > https://github.com/stackabletech/docker-images/commit/ba957b37c8d4c679b918f3815e28d974b0bd008d
> > > >
> > >
> > > On Wed, Dec 6, 2023 at 9:06 AM Nick Dimiduk  wrote:
> > > >
> > > > For what it’s worth, we deployed async-profiler into the regionserver
> > > > container image and it all worked as expected. But it’s not a sidecar
> > > > container, it’s on the same image as the region server.
> > > >
> > > > If you can get the async profiler into your container image, installed
> > > > where the RS can find it (as described in the online book; double-check
> > > you
> > > > have a version of AP that’s compatible with your version of the profiler
> > > > servlet), you should be able to use the profiling http endpoint on the
> > > RS.
> > > > It’ll run async-profiler with the arguments you specify (read the 
> > > > servlet
> > > > code, all args are exposed). You can then download the flamegraph via
> > > HTTP
> > > > as well …
> > > >
> > > > Well, most of the time. I have run into issues where the file wasn’t
> > > served
> > > > correctly and I had to download it from the region server file system
> > > > (annoying to do from a container). There’s probably a closed Jira where 
> > > > I
> > > > scratch my head in public.
> > > >
> > > > On Wed, 6 Dec 2023 at 08:15, Lars Francke 
> > > wrote:
> > > >
> > > > > > > Also, are you sure you couldn't use async-profiler? We use this
> > > all the
> > > > > > > time in our very latency-sensitive production. It has no 
> > > > > > > noticeable
> > > > > > > overhead in my experience and doesn't need any special
> > > dependencies.
> > > > > >
> > > > > > I have to admit, I have never used async-profiler. Shame on me.
> > > > > > That is a fabulous hint and I'll read up on it immediately.
> > > > >
> > > > > I now did read up on it, tried it locally, stumbled over
> > > > > https://issues.apache.org/jira/browse/HBASE-25685 and the fact that
> > > > > 2.4 fails weirdly using Java 21 only to find out (I should have read
> > > > > the whole docs earlier) that it's hard to run async-profiler in a
> > > > > container.
> > > > > For us, this is all running on Kubernetes, so we'll test that today.
> > > > >
> > > > > Testing i tlocally it looked very promising.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > On Tue, Dec 5, 2023 at 3:46 PM Lars Francke <
> > > [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I am debugging an issue where we see some Get requests taking
> > > 2-5s.
> > > > > > > > We do see "responseTooSlow" etc. and this is in an environment
> > > where
> > > > > I
> > > > > > > > cannot run a Profiler but I  _can

Re: Debugging slowness in Gets

2023-12-07 Thread Lars Francke
Hi Nick,

yes, thank you. We have it running in containers now but privileged
containers are not allowed.
I created this https://github.com/apache/hbase/pull/5566 to allow the
"itimer" thing and upgrade to async-profiler in general.
I hope to have some results next week, happy to update this thread.

I also still hope to be able to add some "slow reads" logging once I
identify the relevant bits of code (haven't started looking yet and it
might not be feasible).

Cheers,
Lars


On Wed, Dec 6, 2023 at 12:25 PM Nick Dimiduk  wrote:
>
> Hi Lars,
>
> I took a look through my k8s PR's against hbase-operator-tools and see no
> mention at all of async-profiler;  I have nothing to point you towards. I
> also no longer have access to those systems, so I'm afraid I'm not much
> further help.
>
> I do remember there was some coordination required to enable capture of
> kernel call stacks [0], but I don't recall the details of deploying this in
> the kubernetes environment. Further down the doc [1], there is some mention
> about running within a container, fiddling with the security profile.
>
> [0]: https://github.com/async-profiler/async-profiler#basic-usage
> [1]:
> https://github.com/async-profiler/async-profiler#profiling-java-in-a-container
>
> On Wed, Dec 6, 2023 at 10:03 AM Lars Francke  wrote:
>
> > Thanks Nick,
> > I'll take a look at that.
> >
> > I did add async-profiler to our image[1] but haven't had a chance to
> > test it yet.
> > Do you remember if you had to run the container with extra privileges?
> >
> > I just opened this issue before I saw your mail as it turns out that
> > not all args are exposed after all :)
> > https://issues.apache.org/jira/browse/HBASE-28242 (ideas welcome in this
> > ticket)
> >
> > I'll continue this journey and will report back with any issues and
> > will see if I can improve anything
> >
> > [1] <
> > https://github.com/stackabletech/docker-images/commit/ba957b37c8d4c679b918f3815e28d974b0bd008d
> > >
> >
> > On Wed, Dec 6, 2023 at 9:06 AM Nick Dimiduk  wrote:
> > >
> > > For what it’s worth, we deployed async-profiler into the regionserver
> > > container image and it all worked as expected. But it’s not a sidecar
> > > container, it’s on the same image as the region server.
> > >
> > > If you can get the async profiler into your container image, installed
> > > where the RS can find it (as described in the online book; double-check
> > you
> > > have a version of AP that’s compatible with your version of the profiler
> > > servlet), you should be able to use the profiling http endpoint on the
> > RS.
> > > It’ll run async-profiler with the arguments you specify (read the servlet
> > > code, all args are exposed). You can then download the flamegraph via
> > HTTP
> > > as well …
> > >
> > > Well, most of the time. I have run into issues where the file wasn’t
> > served
> > > correctly and I had to download it from the region server file system
> > > (annoying to do from a container). There’s probably a closed Jira where I
> > > scratch my head in public.
> > >
> > > On Wed, 6 Dec 2023 at 08:15, Lars Francke 
> > wrote:
> > >
> > > > > > Also, are you sure you couldn't use async-profiler? We use this
> > all the
> > > > > > time in our very latency-sensitive production. It has no noticeable
> > > > > > overhead in my experience and doesn't need any special
> > dependencies.
> > > > >
> > > > > I have to admit, I have never used async-profiler. Shame on me.
> > > > > That is a fabulous hint and I'll read up on it immediately.
> > > >
> > > > I now did read up on it, tried it locally, stumbled over
> > > > https://issues.apache.org/jira/browse/HBASE-25685 and the fact that
> > > > 2.4 fails weirdly using Java 21 only to find out (I should have read
> > > > the whole docs earlier) that it's hard to run async-profiler in a
> > > > container.
> > > > For us, this is all running on Kubernetes, so we'll test that today.
> > > >
> > > > Testing i tlocally it looked very promising.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > >
> > > > > > On Tue, Dec 5, 2023 at 3:46 PM Lars Francke <
> > [email protected]>
> > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am debugging an issue where we see some Get requests taking
> > 2-5s.
> > > > > > > We do see "responseTooSlow" etc. and this is in an environment
> > where
> > > > I
> > > > > > > cannot run a Profiler but I  _can_ run modified code.
> > > > > > >
> > > > > > > So what I did was I added a stupid "MethodTimer"[1] which
> > records how
> > > > > > > long certain operations take at various points in the code (e.g.
> > > > [2]).
> > > > > > > I've been doing this a few rounds and have now arrived at the
> > > > StoreScanner.
> > > > > > >
> > > > > > > I'm wondering if anyone has better ideas on how to diagnose this?
> > > > > > > I am a HBase committer but I haven't been able to keep up with
> > the
> > > > > > > changes in the last 5-6 years so I'm not too familiar with the
> > inner
> > > > > 

Re: Debugging slowness in Gets

2023-12-06 Thread Nick Dimiduk
Hi Lars,

I took a look through my k8s PR's against hbase-operator-tools and see no
mention at all of async-profiler;  I have nothing to point you towards. I
also no longer have access to those systems, so I'm afraid I'm not much
further help.

I do remember there was some coordination required to enable capture of
kernel call stacks [0], but I don't recall the details of deploying this in
the kubernetes environment. Further down the doc [1], there is some mention
about running within a container, fiddling with the security profile.

[0]: https://github.com/async-profiler/async-profiler#basic-usage
[1]:
https://github.com/async-profiler/async-profiler#profiling-java-in-a-container

On Wed, Dec 6, 2023 at 10:03 AM Lars Francke  wrote:

> Thanks Nick,
> I'll take a look at that.
>
> I did add async-profiler to our image[1] but haven't had a chance to
> test it yet.
> Do you remember if you had to run the container with extra privileges?
>
> I just opened this issue before I saw your mail as it turns out that
> not all args are exposed after all :)
> https://issues.apache.org/jira/browse/HBASE-28242 (ideas welcome in this
> ticket)
>
> I'll continue this journey and will report back with any issues and
> will see if I can improve anything
>
> [1] <
> https://github.com/stackabletech/docker-images/commit/ba957b37c8d4c679b918f3815e28d974b0bd008d
> >
>
> On Wed, Dec 6, 2023 at 9:06 AM Nick Dimiduk  wrote:
> >
> > For what it’s worth, we deployed async-profiler into the regionserver
> > container image and it all worked as expected. But it’s not a sidecar
> > container, it’s on the same image as the region server.
> >
> > If you can get the async profiler into your container image, installed
> > where the RS can find it (as described in the online book; double-check
> you
> > have a version of AP that’s compatible with your version of the profiler
> > servlet), you should be able to use the profiling http endpoint on the
> RS.
> > It’ll run async-profiler with the arguments you specify (read the servlet
> > code, all args are exposed). You can then download the flamegraph via
> HTTP
> > as well …
> >
> > Well, most of the time. I have run into issues where the file wasn’t
> served
> > correctly and I had to download it from the region server file system
> > (annoying to do from a container). There’s probably a closed Jira where I
> > scratch my head in public.
> >
> > On Wed, 6 Dec 2023 at 08:15, Lars Francke 
> wrote:
> >
> > > > > Also, are you sure you couldn't use async-profiler? We use this
> all the
> > > > > time in our very latency-sensitive production. It has no noticeable
> > > > > overhead in my experience and doesn't need any special
> dependencies.
> > > >
> > > > I have to admit, I have never used async-profiler. Shame on me.
> > > > That is a fabulous hint and I'll read up on it immediately.
> > >
> > > I now did read up on it, tried it locally, stumbled over
> > > https://issues.apache.org/jira/browse/HBASE-25685 and the fact that
> > > 2.4 fails weirdly using Java 21 only to find out (I should have read
> > > the whole docs earlier) that it's hard to run async-profiler in a
> > > container.
> > > For us, this is all running on Kubernetes, so we'll test that today.
> > >
> > > Testing i tlocally it looked very promising.
> > >
> > >
> > >
> > >
> > >
> > > > >
> > > > > On Tue, Dec 5, 2023 at 3:46 PM Lars Francke <
> [email protected]>
> > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am debugging an issue where we see some Get requests taking
> 2-5s.
> > > > > > We do see "responseTooSlow" etc. and this is in an environment
> where
> > > I
> > > > > > cannot run a Profiler but I  _can_ run modified code.
> > > > > >
> > > > > > So what I did was I added a stupid "MethodTimer"[1] which
> records how
> > > > > > long certain operations take at various points in the code (e.g.
> > > [2]).
> > > > > > I've been doing this a few rounds and have now arrived at the
> > > StoreScanner.
> > > > > >
> > > > > > I'm wondering if anyone has better ideas on how to diagnose this?
> > > > > > I am a HBase committer but I haven't been able to keep up with
> the
> > > > > > changes in the last 5-6 years so I'm not too familiar with the
> inner
> > > > > > workings anymore and would appreciate a hint.
> > > > > >
> > > > > > I suspect it is slowness related to storage access.
> > > > > > I was not able to find any logs or tweaks to log "slow storage"
> > > > > > access, does such a thing exist?
> > > > > > And something else that'd help me: Can anyone point me (if it
> exists)
> > > > > > at the (vicinity of the) code that actually reads from HDFS at
> the
> > > > > > end? There are so many layers.
> > > > > >
> > > > > > Thank you!
> > > > > >
> > > > > > Cheers,
> > > > > > Lars
> > > > > >
> > > > > >
> > > > > > [1] <
> > > > > >
> > >
> https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.

Re: Debugging slowness in Gets

2023-12-06 Thread Lars Francke
Thanks Nick,
I'll take a look at that.

I did add async-profiler to our image[1] but haven't had a chance to
test it yet.
Do you remember if you had to run the container with extra privileges?

I just opened this issue before I saw your mail as it turns out that
not all args are exposed after all :)
https://issues.apache.org/jira/browse/HBASE-28242 (ideas welcome in this ticket)

I'll continue this journey and will report back with any issues and
will see if I can improve anything

[1] 


On Wed, Dec 6, 2023 at 9:06 AM Nick Dimiduk  wrote:
>
> For what it’s worth, we deployed async-profiler into the regionserver
> container image and it all worked as expected. But it’s not a sidecar
> container, it’s on the same image as the region server.
>
> If you can get the async profiler into your container image, installed
> where the RS can find it (as described in the online book; double-check you
> have a version of AP that’s compatible with your version of the profiler
> servlet), you should be able to use the profiling http endpoint on the RS.
> It’ll run async-profiler with the arguments you specify (read the servlet
> code, all args are exposed). You can then download the flamegraph via HTTP
> as well …
>
> Well, most of the time. I have run into issues where the file wasn’t served
> correctly and I had to download it from the region server file system
> (annoying to do from a container). There’s probably a closed Jira where I
> scratch my head in public.
>
> On Wed, 6 Dec 2023 at 08:15, Lars Francke  wrote:
>
> > > > Also, are you sure you couldn't use async-profiler? We use this all the
> > > > time in our very latency-sensitive production. It has no noticeable
> > > > overhead in my experience and doesn't need any special dependencies.
> > >
> > > I have to admit, I have never used async-profiler. Shame on me.
> > > That is a fabulous hint and I'll read up on it immediately.
> >
> > I now did read up on it, tried it locally, stumbled over
> > https://issues.apache.org/jira/browse/HBASE-25685 and the fact that
> > 2.4 fails weirdly using Java 21 only to find out (I should have read
> > the whole docs earlier) that it's hard to run async-profiler in a
> > container.
> > For us, this is all running on Kubernetes, so we'll test that today.
> >
> > Testing i tlocally it looked very promising.
> >
> >
> >
> >
> >
> > > >
> > > > On Tue, Dec 5, 2023 at 3:46 PM Lars Francke 
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am debugging an issue where we see some Get requests taking 2-5s.
> > > > > We do see "responseTooSlow" etc. and this is in an environment where
> > I
> > > > > cannot run a Profiler but I  _can_ run modified code.
> > > > >
> > > > > So what I did was I added a stupid "MethodTimer"[1] which records how
> > > > > long certain operations take at various points in the code (e.g.
> > [2]).
> > > > > I've been doing this a few rounds and have now arrived at the
> > StoreScanner.
> > > > >
> > > > > I'm wondering if anyone has better ideas on how to diagnose this?
> > > > > I am a HBase committer but I haven't been able to keep up with the
> > > > > changes in the last 5-6 years so I'm not too familiar with the inner
> > > > > workings anymore and would appreciate a hint.
> > > > >
> > > > > I suspect it is slowness related to storage access.
> > > > > I was not able to find any logs or tweaks to log "slow storage"
> > > > > access, does such a thing exist?
> > > > > And something else that'd help me: Can anyone point me (if it exists)
> > > > > at the (vicinity of the) code that actually reads from HDFS at the
> > > > > end? There are so many layers.
> > > > >
> > > > > Thank you!
> > > > >
> > > > > Cheers,
> > > > > Lars
> > > > >
> > > > >
> > > > > [1] <
> > > > >
> > https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L150C5-L150C5
> > > > > >
> > > > > [2] <
> > > > >
> > https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L289-L297
> > > > > >
> > > > >
> >


Re: Debugging slowness in Gets

2023-12-06 Thread Nick Dimiduk
For what it’s worth, we deployed async-profiler into the regionserver
container image and it all worked as expected. But it’s not a sidecar
container, it’s on the same image as the region server.

If you can get the async profiler into your container image, installed
where the RS can find it (as described in the online book; double-check you
have a version of AP that’s compatible with your version of the profiler
servlet), you should be able to use the profiling http endpoint on the RS.
It’ll run async-profiler with the arguments you specify (read the servlet
code, all args are exposed). You can then download the flamegraph via HTTP
as well …

Well, most of the time. I have run into issues where the file wasn’t served
correctly and I had to download it from the region server file system
(annoying to do from a container). There’s probably a closed Jira where I
scratch my head in public.

On Wed, 6 Dec 2023 at 08:15, Lars Francke  wrote:

> > > Also, are you sure you couldn't use async-profiler? We use this all the
> > > time in our very latency-sensitive production. It has no noticeable
> > > overhead in my experience and doesn't need any special dependencies.
> >
> > I have to admit, I have never used async-profiler. Shame on me.
> > That is a fabulous hint and I'll read up on it immediately.
>
> I now did read up on it, tried it locally, stumbled over
> https://issues.apache.org/jira/browse/HBASE-25685 and the fact that
> 2.4 fails weirdly using Java 21 only to find out (I should have read
> the whole docs earlier) that it's hard to run async-profiler in a
> container.
> For us, this is all running on Kubernetes, so we'll test that today.
>
> Testing i tlocally it looked very promising.
>
>
>
>
>
> > >
> > > On Tue, Dec 5, 2023 at 3:46 PM Lars Francke 
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am debugging an issue where we see some Get requests taking 2-5s.
> > > > We do see "responseTooSlow" etc. and this is in an environment where
> I
> > > > cannot run a Profiler but I  _can_ run modified code.
> > > >
> > > > So what I did was I added a stupid "MethodTimer"[1] which records how
> > > > long certain operations take at various points in the code (e.g.
> [2]).
> > > > I've been doing this a few rounds and have now arrived at the
> StoreScanner.
> > > >
> > > > I'm wondering if anyone has better ideas on how to diagnose this?
> > > > I am a HBase committer but I haven't been able to keep up with the
> > > > changes in the last 5-6 years so I'm not too familiar with the inner
> > > > workings anymore and would appreciate a hint.
> > > >
> > > > I suspect it is slowness related to storage access.
> > > > I was not able to find any logs or tweaks to log "slow storage"
> > > > access, does such a thing exist?
> > > > And something else that'd help me: Can anyone point me (if it exists)
> > > > at the (vicinity of the) code that actually reads from HDFS at the
> > > > end? There are so many layers.
> > > >
> > > > Thank you!
> > > >
> > > > Cheers,
> > > > Lars
> > > >
> > > >
> > > > [1] <
> > > >
> https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L150C5-L150C5
> > > > >
> > > > [2] <
> > > >
> https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L289-L297
> > > > >
> > > >
>


Re: Debugging slowness in Gets

2023-12-05 Thread Lars Francke
> > Also, are you sure you couldn't use async-profiler? We use this all the
> > time in our very latency-sensitive production. It has no noticeable
> > overhead in my experience and doesn't need any special dependencies.
>
> I have to admit, I have never used async-profiler. Shame on me.
> That is a fabulous hint and I'll read up on it immediately.

I now did read up on it, tried it locally, stumbled over
https://issues.apache.org/jira/browse/HBASE-25685 and the fact that
2.4 fails weirdly using Java 21 only to find out (I should have read
the whole docs earlier) that it's hard to run async-profiler in a
container.
For us, this is all running on Kubernetes, so we'll test that today.

Testing i tlocally it looked very promising.





> >
> > On Tue, Dec 5, 2023 at 3:46 PM Lars Francke  wrote:
> >
> > > Hi,
> > >
> > > I am debugging an issue where we see some Get requests taking 2-5s.
> > > We do see "responseTooSlow" etc. and this is in an environment where I
> > > cannot run a Profiler but I  _can_ run modified code.
> > >
> > > So what I did was I added a stupid "MethodTimer"[1] which records how
> > > long certain operations take at various points in the code (e.g. [2]).
> > > I've been doing this a few rounds and have now arrived at the 
> > > StoreScanner.
> > >
> > > I'm wondering if anyone has better ideas on how to diagnose this?
> > > I am a HBase committer but I haven't been able to keep up with the
> > > changes in the last 5-6 years so I'm not too familiar with the inner
> > > workings anymore and would appreciate a hint.
> > >
> > > I suspect it is slowness related to storage access.
> > > I was not able to find any logs or tweaks to log "slow storage"
> > > access, does such a thing exist?
> > > And something else that'd help me: Can anyone point me (if it exists)
> > > at the (vicinity of the) code that actually reads from HDFS at the
> > > end? There are so many layers.
> > >
> > > Thank you!
> > >
> > > Cheers,
> > > Lars
> > >
> > >
> > > [1] <
> > > https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L150C5-L150C5
> > > >
> > > [2] <
> > > https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L289-L297
> > > >
> > >


Re: Debugging slowness in Gets

2023-12-05 Thread Lars Francke
Holy shit, the two of you were fast. Thank you!

On Tue, Dec 5, 2023 at 10:15 PM Bryan Beaudreault
 wrote:
>
> Where exactly in StoreScanner are you able to pinpoint the slowness to?

I'm still at the very beginning. I started at RpcServer and have made
my way to RegionScannerImpl.
It is always about a day between me adding more debug code and getting
results back so it's slow going.
That's why I reached out now to see if others have good ideas on
speeding this up a little.
Turns out that was a good idea. Thank you!

>  You
> might try going deeper into StoreFileScanner -> HFileReaderImpl ->
> HFileBlock. In my experience, the code path follows there as you get closer
> to the FS, but the complexity of the code increases as well. The ultimate
> method just before HDFS is typically around FSReaderImpl.readAtOffset, in
> HFileBlock.java.
>
> I'm unsure if it's available in your version, but histograms exist for fs
> access. Search JMX for FsReadTime.

Will do.
Sorry, should have added. This is on 2.4 (I know, we had issues
upgrading to 2.5 and haven't had a chance to investigate).

>
> You could try seeing how many blocks are required to read the given rows.
> For large rows or rows that span multiple storefiles, this can affect
> performance a lot. An example of how to do that is here:
> https://github.com/apache/hbase/pull/4967

Thanks! Will check but I'm pretty sure all rows are tiny. I'll double
check though.

> Also, are you sure you couldn't use async-profiler? We use this all the
> time in our very latency-sensitive production. It has no noticeable
> overhead in my experience and doesn't need any special dependencies.

I have to admit, I have never used async-profiler. Shame on me.
That is a fabulous hint and I'll read up on it immediately.

Thank you again!

>
> On Tue, Dec 5, 2023 at 3:46 PM Lars Francke  wrote:
>
> > Hi,
> >
> > I am debugging an issue where we see some Get requests taking 2-5s.
> > We do see "responseTooSlow" etc. and this is in an environment where I
> > cannot run a Profiler but I  _can_ run modified code.
> >
> > So what I did was I added a stupid "MethodTimer"[1] which records how
> > long certain operations take at various points in the code (e.g. [2]).
> > I've been doing this a few rounds and have now arrived at the StoreScanner.
> >
> > I'm wondering if anyone has better ideas on how to diagnose this?
> > I am a HBase committer but I haven't been able to keep up with the
> > changes in the last 5-6 years so I'm not too familiar with the inner
> > workings anymore and would appreciate a hint.
> >
> > I suspect it is slowness related to storage access.
> > I was not able to find any logs or tweaks to log "slow storage"
> > access, does such a thing exist?
> > And something else that'd help me: Can anyone point me (if it exists)
> > at the (vicinity of the) code that actually reads from HDFS at the
> > end? There are so many layers.
> >
> > Thank you!
> >
> > Cheers,
> > Lars
> >
> >
> > [1] <
> > https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L150C5-L150C5
> > >
> > [2] <
> > https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L289-L297
> > >
> >


Re: Debugging slowness in Gets

2023-12-05 Thread Lars Francke
On Tue, Dec 5, 2023 at 10:15 PM Wellington Chevreuil
 wrote:
>
> >
> > I suspect it is slowness related to storage access.
> > I was not able to find any logs or tweaks to log "slow storage"
> > access, does such a thing exist?
> >
> I believe I have seen warnings for slow syncs from the hdfs layer, but
> don't remember any for the read path.

Likewise, yeah.
I'll see if I can add something going forward, that might be useful.
If I can find the right spots for it.

>  And something else that'd help me: Can anyone point me (if it exists)
>
> at the (vicinity of the) code that actually reads from HDFS at the
>
> end? There are so many layers.
>
> I think you might be interested in the HFileBlock.readBlockData(). I can
> see we are open a FSDataInputStream there, then pass to some other methods,
> using byte buffers and hadoop IOUtils to read from hdfs.

Excellent, thank you!
I'm sometimes wondering if it  would make sense to document the
architecture somewhere but the audience is small and it would need to
be kept in sync.
That's kept me from doing that so far.


>
> Em ter., 5 de dez. de 2023 às 20:46, Lars Francke 
> escreveu:
>
> > Hi,
> >
> > I am debugging an issue where we see some Get requests taking 2-5s.
> > We do see "responseTooSlow" etc. and this is in an environment where I
> > cannot run a Profiler but I  _can_ run modified code.
> >
> > So what I did was I added a stupid "MethodTimer"[1] which records how
> > long certain operations take at various points in the code (e.g. [2]).
> > I've been doing this a few rounds and have now arrived at the StoreScanner.
> >
> > I'm wondering if anyone has better ideas on how to diagnose this?
> > I am a HBase committer but I haven't been able to keep up with the
> > changes in the last 5-6 years so I'm not too familiar with the inner
> > workings anymore and would appreciate a hint.
> >
> > I suspect it is slowness related to storage access.
> > I was not able to find any logs or tweaks to log "slow storage"
> > access, does such a thing exist?
> > And something else that'd help me: Can anyone point me (if it exists)
> > at the (vicinity of the) code that actually reads from HDFS at the
> > end? There are so many layers.
> >
> > Thank you!
> >
> > Cheers,
> > Lars
> >
> >
> > [1] <
> > https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L150C5-L150C5
> > >
> > [2] <
> > https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L289-L297
> > >
> >


Re: Debugging slowness in Gets

2023-12-05 Thread Bryan Beaudreault
Where exactly in StoreScanner are you able to pinpoint the slowness to? You
might try going deeper into StoreFileScanner -> HFileReaderImpl ->
HFileBlock. In my experience, the code path follows there as you get closer
to the FS, but the complexity of the code increases as well. The ultimate
method just before HDFS is typically around FSReaderImpl.readAtOffset, in
HFileBlock.java.

I'm unsure if it's available in your version, but histograms exist for fs
access. Search JMX for FsReadTime.

You could try seeing how many blocks are required to read the given rows.
For large rows or rows that span multiple storefiles, this can affect
performance a lot. An example of how to do that is here:
https://github.com/apache/hbase/pull/4967

Also, are you sure you couldn't use async-profiler? We use this all the
time in our very latency-sensitive production. It has no noticeable
overhead in my experience and doesn't need any special dependencies.

On Tue, Dec 5, 2023 at 3:46 PM Lars Francke  wrote:

> Hi,
>
> I am debugging an issue where we see some Get requests taking 2-5s.
> We do see "responseTooSlow" etc. and this is in an environment where I
> cannot run a Profiler but I  _can_ run modified code.
>
> So what I did was I added a stupid "MethodTimer"[1] which records how
> long certain operations take at various points in the code (e.g. [2]).
> I've been doing this a few rounds and have now arrived at the StoreScanner.
>
> I'm wondering if anyone has better ideas on how to diagnose this?
> I am a HBase committer but I haven't been able to keep up with the
> changes in the last 5-6 years so I'm not too familiar with the inner
> workings anymore and would appreciate a hint.
>
> I suspect it is slowness related to storage access.
> I was not able to find any logs or tweaks to log "slow storage"
> access, does such a thing exist?
> And something else that'd help me: Can anyone point me (if it exists)
> at the (vicinity of the) code that actually reads from HDFS at the
> end? There are so many layers.
>
> Thank you!
>
> Cheers,
> Lars
>
>
> [1] <
> https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L150C5-L150C5
> >
> [2] <
> https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L289-L297
> >
>


Re: Debugging slowness in Gets

2023-12-05 Thread Wellington Chevreuil
>
> I suspect it is slowness related to storage access.
> I was not able to find any logs or tweaks to log "slow storage"
> access, does such a thing exist?
>
I believe I have seen warnings for slow syncs from the hdfs layer, but
don't remember any for the read path.

 And something else that'd help me: Can anyone point me (if it exists)

at the (vicinity of the) code that actually reads from HDFS at the

end? There are so many layers.

I think you might be interested in the HFileBlock.readBlockData(). I can
see we are open a FSDataInputStream there, then pass to some other methods,
using byte buffers and hadoop IOUtils to read from hdfs.

Em ter., 5 de dez. de 2023 às 20:46, Lars Francke 
escreveu:

> Hi,
>
> I am debugging an issue where we see some Get requests taking 2-5s.
> We do see "responseTooSlow" etc. and this is in an environment where I
> cannot run a Profiler but I  _can_ run modified code.
>
> So what I did was I added a stupid "MethodTimer"[1] which records how
> long certain operations take at various points in the code (e.g. [2]).
> I've been doing this a few rounds and have now arrived at the StoreScanner.
>
> I'm wondering if anyone has better ideas on how to diagnose this?
> I am a HBase committer but I haven't been able to keep up with the
> changes in the last 5-6 years so I'm not too familiar with the inner
> workings anymore and would appreciate a hint.
>
> I suspect it is slowness related to storage access.
> I was not able to find any logs or tweaks to log "slow storage"
> access, does such a thing exist?
> And something else that'd help me: Can anyone point me (if it exists)
> at the (vicinity of the) code that actually reads from HDFS at the
> end? There are so many layers.
>
> Thank you!
>
> Cheers,
> Lars
>
>
> [1] <
> https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L150C5-L150C5
> >
> [2] <
> https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L289-L297
> >
>