Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-25 Thread via GitHub
gianm commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2077577726 Of course! Thank you for contributing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-24 Thread via GitHub
twilliamson commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2075607127 You just made my week!   Thank you so much for taking the time to work through my buggy iterations of this — I really appreciate it! -- This is an automated message from the

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-23 Thread via GitHub
gianm merged PR #16153: URL: https://github.com/apache/druid/pull/16153 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-23 Thread via GitHub
gianm commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2074084352 I just had a look at the latest patch. The list-of-patterns approach makes more intuitive sense to me: the LIKE pattern is split on `%` into fixed-length subpatterns, then each

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-16 Thread via GitHub
gianm commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2060362644 @twilliamson thanks! I will have another look at the latest patch. Hopefully should have some time for that tomorrow. Updating this PR works for me. -- This is an automated message from

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-16 Thread via GitHub
twilliamson commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2059712591 The overnight dedicated-box perf tests finished, and they look good. I've gone ahead an updated this PR with the list-of-`java.util.regex.Pattern` approach. Hopefully that's a good

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-15 Thread via GitHub
twilliamson commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2057525485 I got the custom implementation to work, but it's pretty complicated, and I'm bit worried about there being more bugs. Similarly, switching the regex library to `joni` or `re2`

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-11 Thread via GitHub
twilliamson commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2050082812 You're right: I'm making the same mistake as with the suffix — and now worrying I left the same bug in the Facebook stream processing system…  Sequences of literal characters and

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-11 Thread via GitHub
gianm commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2049007865 This test has an issue on the latest patch: ``` assertMatch("%1 _ 5%6", "1 2 3 1 4 5 6", DruidPredicateMatch.TRUE); ``` The matcher strips off the `6` to get `1 2 3 1 4 5

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-09 Thread via GitHub
gianm commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2046538087 OK, thanks, I'll keep an eye out for the update. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-09 Thread via GitHub
twilliamson commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2046252316 Backtracking isn't needed – the bug in my current PR is in the suffix-handling logic. ☹️ At a high level, there are three parts to the match: - Must start with (the prefix)

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-09 Thread via GitHub
gianm commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2046220895 Just looked more deeply, I think there's a bug in the logic around the greediness of the matching. This test fails on the patch but passes on master: ``` assertMatch("%ba_", "foo

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-09 Thread via GitHub
gianm commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2046146984 > @abhishekagarwal87 @gianm I was just looking at this PR, not sure what you are suggesting for how to make forward progress? Oh, I don't think the discussion about regex libraries

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-08 Thread via GitHub
abhishekagarwal87 commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2044166247 @imply-cheddar - Would be good to know what author thinks of adding support for other regex libraries that do not suffer from catastrophic backtracking in this PR. If not, we

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-08 Thread via GitHub
imply-cheddar commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2044150218 @abhishekagarwal87 @gianm I was just looking at this PR, not sure what you are suggesting for how to make forward progress? -- This is an automated message from the Apache Git

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-05 Thread via GitHub
twilliamson commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2040716015 This info is from 5 or 6 years back while working on stream processing systems at Facebook, but my recollection is that `re2j` had issues with UTF-8 multi-byte sequences. Not sure if

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-05 Thread via GitHub
abhishekagarwal87 commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2039313940 @gianm - https://github.com/google/re2j seems to fit the bill. A cluster-wide setting that switches the regex-based filters to use this library instead of the standard Java

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-04-01 Thread via GitHub
gianm commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2030550437 For `LIKE` it seems reasonable to have an optimized matcher specifically geared towards the needs of `LIKE`. I am disappointed that the JDK regexp library doesn't handle these relatively

Re: [PR] Improve worst-case performance of LIKE filters by 20x (druid)

2024-03-20 Thread via GitHub
abhishekagarwal87 commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2009025245 https://github.com/spring-projects/spring-framework/blob/main/spring-expression/src/main/java/org/springframework/expression/spel/ast/OperatorMatches.java#L128 seems like a