[jira] [Commented] (HBASE-26519) StoreFileScanner parallel seek -- productionize or drop?
[ https://issues.apache.org/jira/browse/HBASE-26519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451902#comment-17451902 ] Andrew Kyle Purtell commented on HBASE-26519: - Unless the potential payoff is significant (yes, this might be hard to guess) I would vote for dropping a complex and incomplete (IMHO) disabled-by-default 'feature' that is, I would estimate, rarely used if at all, probably not at all. > StoreFileScanner parallel seek -- productionize or drop? > > > Key: HBASE-26519 > URL: https://issues.apache.org/jira/browse/HBASE-26519 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Minor > > hbase.storescanner.parallel.seek.enable was added a few years ago in > https://issues.apache.org/jira/browse/HBASE-7495, but still defaults to > disabled. The description of that says "Enables StoreFileScanner > parallel-seeking in StoreScanner, a feature which can reduce response latency > under special conditions". > It's not very clear what "special conditions" means. Reading through the > entire comment history on that issue seems to indicate it can help when you > have "high random read, low cache hit rate, many store files". > We have a bunch of clusters with this shape, and in fact we use SSDs for all > storage so I figured this might help a lot. I tried setting this to true on > one RegionServer of one of our highest QPS clusters hoping I'd see some clear > improvement. This very simple test was pretty much a wash, so I need to do > more methodical testing. > In the test one thing became clear though – is the default thread pool size > of 10 good enough for my use-case? I have no way of knowing, as there is no > logging or metrics that I can find around thread pool saturation. What I > ended up doing was spamming refresh of the /dump endpoint of the RS, and > noticed that there were sometimes 1-5 tasks queued for the RS_PARALLEL_SEEK > executor. This indicates maybe I should scale the thread pool, but use-cases > change over time so this seems like not a great way to determine that. > Task queuing seems not great for a feature which is aimed at reducing > latencies. I wonder if we should consider some changes to make this more easy > to deploy in production. Here are some ideas I had: > * Can we generate a better default value for the thread pool size, maybe > based on number of RS handler threads or some other heuristic? > * Should we consider eliminating queuing for this feature? Instead, if the > threadpool is saturated run the seek in-line in the current thread (i.e. > revert to normal). This would be more similar to how hedged reads work in > HDFS. > * Can we expose a metric or logging to help operators know when to scale up > the thread pool? If we implemented the 2nd option above we could expose > "seeksInCurrentThread" counter to track this, again similar to how hedged > reads report on saturation. > But with all of this said, I wonder if anyone is running this in production > and has any updated guidance on when to use this? Does it still make sense > given the last 8 years of development in HBase? Would it ever make sense to > make it enabled by default? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HBASE-26519) StoreFileScanner parallel seek -- productionize or drop?
[ https://issues.apache.org/jira/browse/HBASE-26519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451801#comment-17451801 ] Bryan Beaudreault commented on HBASE-26519: --- Good call. Done. I'll leave this open for now and relay any decisions or close it once discussion has finished. > StoreFileScanner parallel seek -- productionize or drop? > > > Key: HBASE-26519 > URL: https://issues.apache.org/jira/browse/HBASE-26519 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Minor > > hbase.storescanner.parallel.seek.enable was added a few years ago in > https://issues.apache.org/jira/browse/HBASE-7495, but still defaults to > disabled. The description of that says "Enables StoreFileScanner > parallel-seeking in StoreScanner, a feature which can reduce response latency > under special conditions". > It's not very clear what "special conditions" means. Reading through the > entire comment history on that issue seems to indicate it can help when you > have "high random read, low cache hit rate, many store files". > We have a bunch of clusters with this shape, and in fact we use SSDs for all > storage so I figured this might help a lot. I tried setting this to true on > one RegionServer of one of our highest QPS clusters hoping I'd see some clear > improvement. This very simple test was pretty much a wash, so I need to do > more methodical testing. > In the test one thing became clear though – is the default thread pool size > of 10 good enough for my use-case? I have no way of knowing, as there is no > logging or metrics that I can find around thread pool saturation. What I > ended up doing was spamming refresh of the /dump endpoint of the RS, and > noticed that there were sometimes 1-5 tasks queued for the RS_PARALLEL_SEEK > executor. This indicates maybe I should scale the thread pool, but use-cases > change over time so this seems like not a great way to determine that. > Task queuing seems not great for a feature which is aimed at reducing > latencies. I wonder if we should consider some changes to make this more easy > to deploy in production. Here are some ideas I had: > * Can we generate a better default value for the thread pool size, maybe > based on number of RS handler threads or some other heuristic? > * Should we consider eliminating queuing for this feature? Instead, if the > threadpool is saturated run the seek in-line in the current thread (i.e. > revert to normal). This would be more similar to how hedged reads work in > HDFS. > * Can we expose a metric or logging to help operators know when to scale up > the thread pool? If we implemented the 2nd option above we could expose > "seeksInCurrentThread" counter to track this, again similar to how hedged > reads report on saturation. > But with all of this said, I wonder if anyone is running this in production > and has any updated guidance on when to use this? Does it still make sense > given the last 8 years of development in HBase? Would it ever make sense to > make it enabled by default? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HBASE-26519) StoreFileScanner parallel seek -- productionize or drop?
[ https://issues.apache.org/jira/browse/HBASE-26519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451691#comment-17451691 ] Reid Chan commented on HBASE-26519: --- Sounds more like a discussion topic. Would you mind posting it to dev@hbase email. > StoreFileScanner parallel seek -- productionize or drop? > > > Key: HBASE-26519 > URL: https://issues.apache.org/jira/browse/HBASE-26519 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Minor > > hbase.storescanner.parallel.seek.enable was added a few years ago in > https://issues.apache.org/jira/browse/HBASE-7495, but still defaults to > disabled. The description of that says "Enables StoreFileScanner > parallel-seeking in StoreScanner, a feature which can reduce response latency > under special conditions". > It's not very clear what "special conditions" means. Reading through the > entire comment history on that issue seems to indicate it can help when you > have "high random read, low cache hit rate, many store files". > We have a bunch of clusters with this shape, and in fact we use SSDs for all > storage so I figured this might help a lot. I tried setting this to true on > one RegionServer of one of our highest QPS clusters hoping I'd see some clear > improvement. This very simple test was pretty much a wash, so I need to do > more methodical testing. > In the test one thing became clear though – is the default thread pool size > of 10 good enough for my use-case? I have no way of knowing, as there is no > logging or metrics that I can find around thread pool saturation. What I > ended up doing was spamming refresh of the /dump endpoint of the RS, and > noticed that there were sometimes 1-5 tasks queued for the RS_PARALLEL_SEEK > executor. This indicates maybe I should scale the thread pool, but use-cases > change over time so this seems like not a great way to determine that. > Task queuing seems not great for a feature which is aimed at reducing > latencies. I wonder if we should consider some changes to make this more easy > to deploy in production. Here are some ideas I had: > * Can we generate a better default value for the thread pool size, maybe > based on number of RS handler threads or some other heuristic? > * Should we consider eliminating queuing for this feature? Instead, if the > threadpool is saturated run the seek in-line in the current thread (i.e. > revert to normal). This would be more similar to how hedged reads work in > HDFS. > * Can we expose a metric or logging to help operators know when to scale up > the thread pool? If we implemented the 2nd option above we could expose > "seeksInCurrentThread" counter to track this, again similar to how hedged > reads report on saturation. > But with all of this said, I wonder if anyone is running this in production > and has any updated guidance on when to use this? Does it still make sense > given the last 8 years of development in HBase? Would it ever make sense to > make it enabled by default? -- This message was sent by Atlassian Jira (v8.20.1#820001)