Re: [HACKERS] WIP: lookbehind constraints for our regexp engine
On Friday, October 16, 2015, Tom Lane wrote: > > Anyway, I'm not planning to do much more work on this right now, but > I thought I'd throw it out there just to let people know that this seems > within reach. I'm curious to know how many people care, and how much. > > +1 It's hard to quantify how much but look behind is a tool that I find occasion to use. My most recent need involved ensuring that the current position after processing two variable length matches is exactly n characters from the start of the string. I don't have any immediate projects that would be refactored into a more pure SQL variation for having this feature so on that the strength is fairly low. But if you are going to do it then now, while it is fresh, would be the best time. David J.
Re: [HACKERS] WIP: lookbehind constraints for our regexp engine
On Oct 17, 2015 12:53 AM, "Tom Lane" wrote: > > I've occasionally heard complaints that our regex engine only has > lookahead constraints not lookbehind constraints, while Perl's for example > does have those. While I was fooling around with the other issues in that > code, I learned enough to realize that it would not be that hard to > implement such things, and the attached proof-of-concept patch does so > (using Perl-compatible syntax, in case you're wondering). > > However, I don't feel like this is done to the point of being committable, > because its performance leaves something to be desired in a lot of cases. > The difficulty is that because the engine can only match in a forward > direction, in the worst case you have to check a lookbehind constraint by > trying to match starting from the string start to see if a match exists > ending at the current-location where you're testing the constraint. That > makes it, worst case, O(N^2) to search a string of length N. Now, you can > improve on that greatly if you can determine that possible matches don't > extend all the way back to the string start. In the attached patch I use > the "cold start pointer" returned by shortest() to decide that characters > before the coldstart point no longer have to be rechecked. That helps > some, but it's not good enough because there are too many cases where the > engine doesn't move the coldstart point forward very aggressively. > > It might be that that behavior itself can be improved on, which would > be nice because there are also non-lookbehind-related scenarios where > smarter coldstart detection would help performance. But I have no very > good ideas about how to do it. > > Another idea I've been toying with is to add logic that tries to determine > the maximum possible match length for a regex. If we can determine that > for the lookbehind sub-regex, then we'd know we have to back up at most > that far before applying the match check. This seems promising because > a lot of other engines don't even support variable-length lookbehind > patterns at all, and so we'd be assured of good performance in all the > cases that competitors can handle at all. > > Anyway, I'm not planning to do much more work on this right now, but > I thought I'd throw it out there just to let people know that this seems > within reach. I'm curious to know how many people care, and how much. +1 I'm one of those who wanted lookbehind, and it would give us complete lookaround so very much welcome. Thom
[HACKERS] WIP: lookbehind constraints for our regexp engine
I've occasionally heard complaints that our regex engine only has lookahead constraints not lookbehind constraints, while Perl's for example does have those. While I was fooling around with the other issues in that code, I learned enough to realize that it would not be that hard to implement such things, and the attached proof-of-concept patch does so (using Perl-compatible syntax, in case you're wondering). However, I don't feel like this is done to the point of being committable, because its performance leaves something to be desired in a lot of cases. The difficulty is that because the engine can only match in a forward direction, in the worst case you have to check a lookbehind constraint by trying to match starting from the string start to see if a match exists ending at the current-location where you're testing the constraint. That makes it, worst case, O(N^2) to search a string of length N. Now, you can improve on that greatly if you can determine that possible matches don't extend all the way back to the string start. In the attached patch I use the "cold start pointer" returned by shortest() to decide that characters before the coldstart point no longer have to be rechecked. That helps some, but it's not good enough because there are too many cases where the engine doesn't move the coldstart point forward very aggressively. It might be that that behavior itself can be improved on, which would be nice because there are also non-lookbehind-related scenarios where smarter coldstart detection would help performance. But I have no very good ideas about how to do it. Another idea I've been toying with is to add logic that tries to determine the maximum possible match length for a regex. If we can determine that for the lookbehind sub-regex, then we'd know we have to back up at most that far before applying the match check. This seems promising because a lot of other engines don't even support variable-length lookbehind patterns at all, and so we'd be assured of good performance in all the cases that competitors can handle at all. Anyway, I'm not planning to do much more work on this right now, but I thought I'd throw it out there just to let people know that this seems within reach. I'm curious to know how many people care, and how much. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 2946122..4d482ec 100644 *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *** SELECT foo FROM regexp_split_to_table('t *** 4477,4489 where no substring matching re begins (AREs only) ! Lookahead constraints cannot contain back references ! (see ), and all parentheses within them are considered non-capturing. --- 4477,4503 where no substring matching re begins (AREs only) + + + (?<=re) + positive lookbehind matches at any point +where a substring matching re ends +(AREs only) + + + + (?to_state" or "$1->to_state" for end-of-string and end-of-line constraints respectively. ! LACON constraints, which represent "(?=re)" and "(?!re)" constraints, ! i.e. the input starting at this point must match (or not match) a ! given sub-RE, but the matching input is not consumed. These are ! dumped as ":subtree_number:->to_state". If you see anything else (especially any question marks) in the display of an arc, it's dumpnfa() trying to tell you that there's somethin