On 2018/11/08 00:13:39, "Matthias J. Sax" <matth...@confluent.io> wrote: > That is what I try to figure out. I went over the 0.10.2.2 to 0.11.0.3 > Jiras but found nothing I could point out. There are couple of > SessionStore related tickets, but none of them should have an effect > like this. > > To narrow it down, it would be helpful to test with other versions, too. > Maybe 0.10.2.2 and 0.11.0.0 to see when the issue was introduced.
Done. So far here's what my tests have shown: 0.10.2.1 (the current version we're running) and 0.10.2.2, the local cache works properly and we see thread profiles similar to what I posted earlier, where the majority of time is spent in RockDB and there's no lag. Testing with 0.11.0.0, 0.11.0.3, 1.1.1, 2.0.0 and 2.0.1 all show us spending the majority of time in the local cache and we lag considerably: https://imgur.com/l5VEsC2 > Can you also profile v0.10.2.1 so we can compare? Here's a recent profile for 0.10.2.1: https://imgur.com/a/Sto636s > > What would you recommend for our next steps? > > Not sure. If you could help us to track down the issue, that would be > most helpful so get a fix (and you could run from a SNAPSHOT version to > get the fix -- not sure if this would be an option for you). Another developer took a look a the code and he had some thoughts: "It appears we're scanning an order of magnitude more keys for every call to `findSessions`. You can see this manifest in the flush logs where version 0.11.0.3 and later will have a billion hits on the cache in 10 minutes, even though the number of events consumed is only 1M. It seems like when they made some fixes to make sure all possible windows for a session merge are found that resulted in having to scan every entry in the cache." Is there a way for us to refine the cache search so we're not searching the entire key space?