Re: [HACKERS] WIP: lookbehind constraints for our regexp engine

2015-10-17 Thread David G. Johnston
On Friday, October 16, 2015, Tom Lane  wrote:

>
> Anyway, I'm not planning to do much more work on this right now, but
> I thought I'd throw it out there just to let people know that this seems
> within reach.  I'm curious to know how many people care, and how much.
>
>
+1

It's hard to quantify how much but look behind is a tool that I find
occasion to use.  My most recent need involved ensuring that the current
position after processing two variable length matches is exactly n
characters from the start of the string.

I don't have any immediate projects that would be refactored into a more
pure SQL variation for having this feature so on that the strength is
fairly low.  But if you are going to do it then now, while it is fresh,
would be the best time.

David J.


Re: [HACKERS] WIP: lookbehind constraints for our regexp engine

2015-10-17 Thread Thom Brown
On Oct 17, 2015 12:53 AM, "Tom Lane"  wrote:
>
> I've occasionally heard complaints that our regex engine only has
> lookahead constraints not lookbehind constraints, while Perl's for example
> does have those.  While I was fooling around with the other issues in that
> code, I learned enough to realize that it would not be that hard to
> implement such things, and the attached proof-of-concept patch does so
> (using Perl-compatible syntax, in case you're wondering).
>
> However, I don't feel like this is done to the point of being committable,
> because its performance leaves something to be desired in a lot of cases.
> The difficulty is that because the engine can only match in a forward
> direction, in the worst case you have to check a lookbehind constraint by
> trying to match starting from the string start to see if a match exists
> ending at the current-location where you're testing the constraint.  That
> makes it, worst case, O(N^2) to search a string of length N.  Now, you can
> improve on that greatly if you can determine that possible matches don't
> extend all the way back to the string start.  In the attached patch I use
> the "cold start pointer" returned by shortest() to decide that characters
> before the coldstart point no longer have to be rechecked.  That helps
> some, but it's not good enough because there are too many cases where the
> engine doesn't move the coldstart point forward very aggressively.
>
> It might be that that behavior itself can be improved on, which would
> be nice because there are also non-lookbehind-related scenarios where
> smarter coldstart detection would help performance.  But I have no very
> good ideas about how to do it.
>
> Another idea I've been toying with is to add logic that tries to determine
> the maximum possible match length for a regex.  If we can determine that
> for the lookbehind sub-regex, then we'd know we have to back up at most
> that far before applying the match check.  This seems promising because
> a lot of other engines don't even support variable-length lookbehind
> patterns at all, and so we'd be assured of good performance in all the
> cases that competitors can handle at all.
>
> Anyway, I'm not planning to do much more work on this right now, but
> I thought I'd throw it out there just to let people know that this seems
> within reach.  I'm curious to know how many people care, and how much.

+1

I'm one of those who wanted lookbehind, and it would give us complete
lookaround so very much welcome.

Thom


[HACKERS] WIP: lookbehind constraints for our regexp engine

2015-10-16 Thread Tom Lane
I've occasionally heard complaints that our regex engine only has
lookahead constraints not lookbehind constraints, while Perl's for example
does have those.  While I was fooling around with the other issues in that
code, I learned enough to realize that it would not be that hard to
implement such things, and the attached proof-of-concept patch does so
(using Perl-compatible syntax, in case you're wondering).

However, I don't feel like this is done to the point of being committable,
because its performance leaves something to be desired in a lot of cases.
The difficulty is that because the engine can only match in a forward
direction, in the worst case you have to check a lookbehind constraint by
trying to match starting from the string start to see if a match exists
ending at the current-location where you're testing the constraint.  That
makes it, worst case, O(N^2) to search a string of length N.  Now, you can
improve on that greatly if you can determine that possible matches don't
extend all the way back to the string start.  In the attached patch I use
the "cold start pointer" returned by shortest() to decide that characters
before the coldstart point no longer have to be rechecked.  That helps
some, but it's not good enough because there are too many cases where the
engine doesn't move the coldstart point forward very aggressively.

It might be that that behavior itself can be improved on, which would
be nice because there are also non-lookbehind-related scenarios where
smarter coldstart detection would help performance.  But I have no very
good ideas about how to do it.

Another idea I've been toying with is to add logic that tries to determine
the maximum possible match length for a regex.  If we can determine that
for the lookbehind sub-regex, then we'd know we have to back up at most
that far before applying the match check.  This seems promising because
a lot of other engines don't even support variable-length lookbehind
patterns at all, and so we'd be assured of good performance in all the
cases that competitors can handle at all.

Anyway, I'm not planning to do much more work on this right now, but
I thought I'd throw it out there just to let people know that this seems
within reach.  I'm curious to know how many people care, and how much.

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2946122..4d482ec 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*** SELECT foo FROM regexp_split_to_table('t
*** 4477,4489 
 where no substring matching re begins
 (AREs only) 
 

   
  
  
 
! Lookahead constraints cannot contain back references
! (see ),
  and all parentheses within them are considered non-capturing.
 
 
--- 4477,4503 
 where no substring matching re begins
 (AREs only) 
 
+ 
+
+ (?<=re) 
+ positive lookbehind matches at any point
+where a substring matching re ends
+(AREs only) 
+
+ 
+
+ (?to_state" or "$1->to_state" for end-of-string and end-of-line
  constraints respectively.
  
! LACON constraints, which represent "(?=re)" and "(?!re)" constraints,
! i.e. the input starting at this point must match (or not match) a
! given sub-RE, but the matching input is not consumed.  These are
! dumped as ":subtree_number:->to_state".
  
  If you see anything else (especially any question marks) in the display of
  an arc, it's dumpnfa() trying to tell you that there's somethin