Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-12-25 Thread Joel Jacobson
On Tue, Sep 21, 2021, at 19:55, Tom Lane wrote:
> "Joel Jacobson"  writes:
> > [ 0005-regexp-positions.patch ]
>
> I took a quick look at this patch.  I am on board now with the general
> idea of returning match positions, but I think there are still
> definitional issues to be discussed.
>
> 1. The main idea we might perhaps want to adopt from the Oracle-ish
> regexp functions is a search-start-position argument.  I'm not sure
> that this is exciting --- if we intend this function to be a close
> analog of regexp_matches(), it'd be better to leave that off.  But
> it deserves explicit consideration.

I intentionally made it a close analog of regexp_matches(),
to make it easy for existing users of regexp_matches() to
understand how regexp_positions() works.

> 2. It looks like you modeled this on regexp_matches() to the extent
> of returning a set and using the 'g' flag to control whether the
> set can actually contain more than one row.  That's pretty ancient
> precedent ... but we have revisited that design more than once
> because regexp_matches() is just too much of a pain to use.  I think
> if we're going to do this, we should learn from history, and provide an
> analog to regexp_match() as well as regexp_matches() right off the bat.

Yes, I modeled it on regexp_matches().
OK, so what you are suggesting is we should add both a regexp_position() 
function,
that would work like regexp_match() but return the position instead,
in addition to the already suggested regexp_positions() function.

That sounds like a good idea to me.

> 3. The API convention you chose (separate start and length arrays)
> is perhaps still confusing.  When I first looked at the test case
>
> +SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
> + regexp_positions  
> +---
> + ("{4,7}","{3,5}")
> +(1 row)
>
> I thought it was all wrong because it seemed to be identifying
> the substrings 'barbequ' and 'obarb'.  If there'd been a different
> number of matches than capture groups, maybe I'd not have been
> confused, but still...  I wonder if we'd be better advised to make
> N capture groups produce N two-element arrays, or else mash it all
> into one array of N*2 elements.  But this probably depends on which
> way is the easiest to work with in SQL.

The drawbacks of two-element arrays have already been discussed up thread.
Personally, I prefer the version suggested in the latest patch, and suggest we 
stick to it.

> 4. Not sure about the handling of sub-matches.
> There are various plausible definitions we could use:
>
> * We return the position/length of the overall match, never mind
> about any parenthesized subexpressions.  This is simple but I think
> it loses significant functionality.  As an example, you might have
> a pattern like 'ab*(c*)d+' where what you actually want to know
> is where the 'c's are, but they have to be in the context described
> by the rest of the regexp.  Without subexpression-match capability
> that's painful to do.
>
> * If there's a capturing subexpression, return the position/length
> of the first such subexpression, else return the overall match.
> This matches the behavior of substring().
>
> * If there are capturing subexpression(s), return the
> positions/lengths of those, otherwise return the overall match.
> This agrees with the behavior of regexp_match(es), so I'd tend
> to lean to this option, but perhaps it's the hardest to use.
>
> * Return the position/length of the overall match *and* those of
> each capturing subexpression.  This is the most flexible choice,
> but I give it low marks since it matches no existing behaviors.

The existing behaviour of regexp_match(es) is perhaps a bit surprising to new 
users,
but I think one quickly learn the semantics by trying out a few examples,
and once understood it's at least not something that bothers me personally.

I think it's best to let regexp_position(s) work the same way as 
regexp_match(es),
since otherwise users would have to learn and remember two different behaviours.

> As for comments on the patch itself:
>
> * The documentation includes an extraneous entry for regexp_replace,
> and it fails to add the promised paragraph to functions-posix-regexp.
>
> * This bit is evidently copied from regexp_matches:
>
> +/* be sure to copy the input string into the multi-call ctx */
> +matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,
>
> regexp_matches needs to save the input string so that
> build_regexp_match_result can copy parts of that.  But
> regexp_positions has no such need AFAICS, so I think we
> don't need a long-lived copy of the string.
>
> * I wouldn't use these names in build_regexp_positions_result:
>
> +ArrayType  *so_ary;
> +ArrayType  *eo_ary;
>
> "so_ary" isn't awful, but it invites confusion with regex's "so"
> field, which hasn't got the same semantics (off by one).
> "eo_ary" is pretty bad because it isn't an "end offset" at all, but
> a length.  I'd go 

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-12-03 Thread Daniel Gustafsson
This review has gone unanswered for two months, so I'm marking this patch
Returned with Feedback.  Please feel free to resubmit when a new version of the
patch is available.

--
Daniel Gustafsson   https://vmware.com/





Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-09-21 Thread Tom Lane
"Joel Jacobson"  writes:
> [ 0005-regexp-positions.patch ]

I took a quick look at this patch.  I am on board now with the general
idea of returning match positions, but I think there are still
definitional issues to be discussed.

1. The main idea we might perhaps want to adopt from the Oracle-ish
regexp functions is a search-start-position argument.  I'm not sure
that this is exciting --- if we intend this function to be a close
analog of regexp_matches(), it'd be better to leave that off.  But
it deserves explicit consideration.

2. It looks like you modeled this on regexp_matches() to the extent
of returning a set and using the 'g' flag to control whether the
set can actually contain more than one row.  That's pretty ancient
precedent ... but we have revisited that design more than once
because regexp_matches() is just too much of a pain to use.  I think
if we're going to do this, we should learn from history, and provide an
analog to regexp_match() as well as regexp_matches() right off the bat.

3. The API convention you chose (separate start and length arrays)
is perhaps still confusing.  When I first looked at the test case

+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
+ regexp_positions  
+---
+ ("{4,7}","{3,5}")
+(1 row)

I thought it was all wrong because it seemed to be identifying
the substrings 'barbequ' and 'obarb'.  If there'd been a different
number of matches than capture groups, maybe I'd not have been
confused, but still...  I wonder if we'd be better advised to make
N capture groups produce N two-element arrays, or else mash it all
into one array of N*2 elements.  But this probably depends on which
way is the easiest to work with in SQL.

4. Not sure about the handling of sub-matches.
There are various plausible definitions we could use:

* We return the position/length of the overall match, never mind
about any parenthesized subexpressions.  This is simple but I think
it loses significant functionality.  As an example, you might have
a pattern like 'ab*(c*)d+' where what you actually want to know
is where the 'c's are, but they have to be in the context described
by the rest of the regexp.  Without subexpression-match capability
that's painful to do.

* If there's a capturing subexpression, return the position/length
of the first such subexpression, else return the overall match.
This matches the behavior of substring().

* If there are capturing subexpression(s), return the
positions/lengths of those, otherwise return the overall match.
This agrees with the behavior of regexp_match(es), so I'd tend
to lean to this option, but perhaps it's the hardest to use.

* Return the position/length of the overall match *and* those of
each capturing subexpression.  This is the most flexible choice,
but I give it low marks since it matches no existing behaviors.


As for comments on the patch itself:

* The documentation includes an extraneous entry for regexp_replace,
and it fails to add the promised paragraph to functions-posix-regexp.

* This bit is evidently copied from regexp_matches:

+/* be sure to copy the input string into the multi-call ctx */
+matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,

regexp_matches needs to save the input string so that
build_regexp_match_result can copy parts of that.  But
regexp_positions has no such need AFAICS, so I think we
don't need a long-lived copy of the string.

* I wouldn't use these names in build_regexp_positions_result:

+ArrayType  *so_ary;
+ArrayType  *eo_ary;

"so_ary" isn't awful, but it invites confusion with regex's "so"
field, which hasn't got the same semantics (off by one).
"eo_ary" is pretty bad because it isn't an "end offset" at all, but
a length.  I'd go for "start_ary" and "length_ary" or some such.

* Test cases seem a bit thin, notably there's no coverage of the
null-subexpression code path.

regards, tom lane




Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-09-12 Thread Joel Jacobson
On Thu, Sep 2, 2021, at 00:03, Daniel Gustafsson wrote:
> I can see value in a function like this one, and the API is AFAICT fairly
> aligned with what I as a user would expect it to be given what we already 
> have.

Good to hear and thanks for looking at this patch.

I've fixed the problem due to the recent change in setup_regexp_matches(),
which added a new int parameter "start_search".
I pass 0 as start_search, which I think should give the same behaviour as 
before.

I also changed the assigned oid values in pg_proc.dat for the two new 
regexp_positions() catalog functions.

$ make check

===
All 209 tests passed.
===

/Joel





0005-regexp-positions.patch
Description: Binary data


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-09-01 Thread Daniel Gustafsson
> On 1 Sep 2021, at 16:02, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>>> On 9 Mar 2021, at 20:30, Joel Jacobson  wrote:
>>> Attached is a patch implementing it this way.
> 
>> This patch no longer applies, can you please submit a rebased version?

On a brief skim, this patch includes the doc stanza for regexp_replace which I
assume is a copy/pasteo.

+   TupleDescInitEntry(tupdesc, (AttrNumber) 1, "starts”,
While “start_positions” is awfully verbose, just “starts” doesn’t really roll
off the tongue.  Perhaps “positions” would be more explanatory?

> Also, since 642433707 ("This patch adds new functions regexp_count(),
> regexp_instr(), regexp_like(), and regexp_substr(), and extends
> regexp_replace() with some new optional arguments") is already in,
> we need to think about how this interacts with that.  Do we even
> still need any more functionality in this area?  Should we try to
> align the APIs?

I can see value in a function like this one, and the API is AFAICT fairly
aligned with what I as a user would expect it to be given what we already have.

--
Daniel Gustafsson   https://vmware.com/





Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-09-01 Thread Tom Lane
Daniel Gustafsson  writes:
>> On 9 Mar 2021, at 20:30, Joel Jacobson  wrote:
>> Attached is a patch implementing it this way.

> This patch no longer applies, can you please submit a rebased version?

Also, since 642433707 ("This patch adds new functions regexp_count(),
regexp_instr(), regexp_like(), and regexp_substr(), and extends
regexp_replace() with some new optional arguments") is already in,
we need to think about how this interacts with that.  Do we even
still need any more functionality in this area?  Should we try to
align the APIs?

Those new function APIs have some Oracle-isms that I don't especially
care for, like use of int for what should be a boolean.  Still, users
aren't going to give us a pass for wildly inconsistent APIs just because
some functions came from Oracle and some didn't.

regards, tom lane




Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-09-01 Thread Daniel Gustafsson
> On 9 Mar 2021, at 20:30, Joel Jacobson  wrote:

> Attached is a patch implementing it this way.

This patch no longer applies, can you please submit a rebased version?

--
Daniel Gustafsson   https://vmware.com/





Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Joel Jacobson
On Tue, Mar 9, 2021, at 17:42, Tom Lane wrote:
> "Joel Jacobson"  writes:
> > Tom - can you please give details on your unpleasant experiences with 
> > parallel arrays?
> 
> The problems I can recall running into were basically down to not having
> an easy way to iterate through parallel arrays.  There are ways to do
> that in SQL, certainly, but they all constrain how you write the query,
> and usually force ugly stuff like splitting it into sub-selects.

I see now what you mean, many thanks for explaining.

> 
> As an example, presuming that regexp_positions is defined along the
> lines of
> 
> regexp_positions(str text, pat text, out starts int[], out lengths int[])
> returns setof record

+1

I think this is the most feasible best option so far.

Attached is a patch implementing it this way.

I changed the start to begin at 1, since this is how position ( substring text 
IN string text ) → integer works.

SELECT * FROM regexp_positions('foobarbequebaz', '^', 'g');
starts | lengths
+-
{1}| {0}
(1 row)

SELECT * FROM regexp_positions('foobarbequebaz', 'ba.', 'g');
starts | lengths
+-
{4}| {3}
{12}   | {3}
(2 rows)

Mark's examples:

SELECT * FROM regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
starts | lengths
+-
{7}| {0}
(1 row)


SELECT * FROM regexp_positions('foobarbequebaz', '(?<=z)', 'g');
starts | lengths
+-
{15}   | {0}
(1 row)

I've also tested your template queries:

> 
> then to actually get the identified substrings you'd have to do something
> like
> 
> select
>   substring([input string] from starts[i] for lengths[i])
> from
>   regexp_positions([input string], [pattern]) r,
>   lateral
> generate_series(1, array_length(starts, 1)) i;

select
  substring('foobarbequebaz' from starts[i] for lengths[i])
from
  regexp_positions('foobarbequebaz', 'ba.', 'g') r,
  lateral
generate_series(1, array_length(starts, 1)) i;

substring
---
bar
baz
(2 rows)

> I think the last time I confronted this, we didn't have multi-array
> UNNEST.  Now that we do, we can get rid of the generate_series(),
> but it's still not beautiful:
> 
> select
>   substring([input string] from s for l)
> from
>   regexp_positions([input string], [pattern]) r,
>   lateral
> unnest(starts, lengths) u(s,l);

select
  substring('foobarbequebaz' from s for l)
from
  regexp_positions('foobarbequebaz', 'ba.', 'g') r,
  lateral
unnest(starts, lengths) u(s,l);

substring
---
bar
baz
(2 rows)

> Having said that, the other alternative with a 2-D array:
> 
> regexp_positions(str text, pat text) returns setof int[]
> 
> seems to still need UNNEST, though now it's not the magic multi-array
> UNNEST but this slicing version:
> 
> select
>   substring([input string] from u[1] for u[2])
> from
>   regexp_positions([input string], [pattern]) r,
>   lateral
> unnest_slice(r, 1) u;

Unable to test this one since there is no unnest_slice() (yet)

> 
> Anyway, I'd counsel trying to write out SQL implementations
> of regexp_matches() and other useful things based on any
> particular regexp_positions() API you might be thinking about.
> Can we do anything useful without a LATERAL UNNEST thingie?
> Are some of them more legible than others?

Hmm, I cannot think of a way.


/Joel

0004-regexp-positions.patch
Description: Binary data


Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Tom Lane
"Joel Jacobson"  writes:
> Tom - can you please give details on your unpleasant experiences with 
> parallel arrays?

The problems I can recall running into were basically down to not having
an easy way to iterate through parallel arrays.  There are ways to do
that in SQL, certainly, but they all constrain how you write the query,
and usually force ugly stuff like splitting it into sub-selects.

As an example, presuming that regexp_positions is defined along the
lines of

regexp_positions(str text, pat text, out starts int[], out lengths int[])
returns setof record

then to actually get the identified substrings you'd have to do something
like

select
  substring([input string] from starts[i] for lengths[i])
from
  regexp_positions([input string], [pattern]) r,
  lateral
generate_series(1, array_length(starts, 1)) i;

I think the last time I confronted this, we didn't have multi-array
UNNEST.  Now that we do, we can get rid of the generate_series(),
but it's still not beautiful:

select
  substring([input string] from s for l)
from
  regexp_positions([input string], [pattern]) r,
  lateral
unnest(starts, lengths) u(s,l);

Having said that, the other alternative with a 2-D array:

regexp_positions(str text, pat text) returns setof int[]

seems to still need UNNEST, though now it's not the magic multi-array
UNNEST but this slicing version:

select
  substring([input string] from u[1] for u[2])
from
  regexp_positions([input string], [pattern]) r,
  lateral
unnest_slice(r, 1) u;

Anyway, I'd counsel trying to write out SQL implementations
of regexp_matches() and other useful things based on any
particular regexp_positions() API you might be thinking about.
Can we do anything useful without a LATERAL UNNEST thingie?
Are some of them more legible than others?

regards, tom lane




Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Tom Lane
"Joel Jacobson"  writes:
> On Tue, Mar 9, 2021, at 10:18, Pavel Stehule wrote:
>> you can do unnest(array1, array2, ...)

> Right, I had forgotten about that variant.
> But isn't this a bit surprising then:
> ...
> Should there be an entry there showing the VARIADIC anyelement version as 
> well?

No, because there's no such pg_proc entry.  Multi-argument UNNEST is
special-cased by the parser, cf transformRangeFunction().

(Which is something I'd momentarily forgotten.  Forget my suggestion
that we could define unnest(anyarray, int) ... it has to be another
name.)

regards, tom lane




Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Joel Jacobson
On Tue, Mar 9, 2021, at 13:16, Pavel Stehule wrote:
> út 9. 3. 2021 v 11:32 odesílatel Joel Jacobson  napsal:
>> __On Thu, Mar 4, 2021, at 16:40, Tom Lane wrote:
>>> My experience with working with parallel arrays in SQL has been unpleasant.
>> 
>> Could you please give an example on such an unpleasant experience?
> 
> it was more complex application with 3D data of some points in 2D array. 
> Everywhere was a[d, 0], a[d, 1], a[d, 2], instead a[d] or instead a[d].x, ...

Not sure I understand, but my question was directed to Tom, who wrote about his 
experiences in a previous message up thread.

Tom - can you please give details on your unpleasant experiences with parallel 
arrays?

/Joel

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Pavel Stehule
út 9. 3. 2021 v 11:32 odesílatel Joel Jacobson  napsal:

> On Thu, Mar 4, 2021, at 16:40, Tom Lane wrote:
>
> My experience with working with parallel arrays in SQL has been unpleasant.
>
>
> Could you please give an example on such an unpleasant experience?
>

it was more complex application with 3D data of some points in 2D array.
Everywhere was a[d, 0], a[d, 1], a[d, 2], instead a[d] or instead a[d].x,
...



> I can see a problem if the arrays could possibly have difference
> dimensionality/cardinality,
> but regexp_positions() could guarantee they won't, so I don't see a
> problem here,
> but there is probably something I'm missing here?
>

I think so the functions based on arrays can work, why not. But the
semantic is lost.


> /Joel
>


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Joel Jacobson
On Thu, Mar 4, 2021, at 16:40, Tom Lane wrote:
> My experience with working with parallel arrays in SQL has been unpleasant.

Could you please give an example on such an unpleasant experience?

I can see a problem if the arrays could possibly have difference 
dimensionality/cardinality,
but regexp_positions() could guarantee they won't, so I don't see a problem 
here,
but there is probably something I'm missing here?

/Joel

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Joel Jacobson
On Tue, Mar 9, 2021, at 10:18, Pavel Stehule wrote:
> What do you mean?
>> More than one unnest() in the same query, e.g. SELECT unnest(..), unnest(..)?
> 
> you can do unnest(array1, array2, ...)

Right, I had forgotten about that variant.

But isn't this a bit surprising then:

\df unnest
List of functions
   Schema   |  Name  | Result data type |   
Argument data types| Type
++--+--+--
pg_catalog | unnest | SETOF anyelement | anyarray   
  | func
pg_catalog | unnest | SETOF record | tsvector tsvector, OUT lexeme text, 
OUT positions smallint[], OUT weights text[] | func
(2 rows)

Should there be an entry there showing the VARIADIC anyelement version as well?

I know it's a documented feature, but \df seems out-of-sync with the docs.

/Joel


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Pavel Stehule
út 9. 3. 2021 v 10:01 odesílatel Joel Jacobson  napsal:

> On Tue, Mar 9, 2021, at 09:29, Pavel Stehule wrote:
>
>
>
> https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-FOREACH-ARRAY
>
> but the SLICE argument should be constant. But this limit is artificial,
> just for implementation simplicity.  Important is behaviour.
>
>
> I see now what you mean. Yes, being able to specify the SLICE argument as
> a variable instead of a constant would be a good improvement. Maybe the
> SLICE implementation from PL/pgSQL could be modified and used for both
> cases? (Both in the C-version unnest() and in PL/pgSQL to allow variables
> and not just constants to SLICE)
>

probably


>
>
>
> 2. use unnest_slice as name - the function "unnest" is relatively rich
> today and using other overloading doesn't look too practical.
>
>
> Hm, rich in what way? There is currently only one version for arrays, and
> a different one for tsvector.
>
>
> no, there is possible to unnest more arrays once
>
>
> What do you mean?
> More than one unnest() in the same query, e.g. SELECT unnest(..),
> unnest(..)?
>

you can do unnest(array1, array2, ...)


> I cannot find any function in Postgres that returns a 2D array now.
>
> For me - using 2D arrays is not a win. It is not a bad solution, but I
> cannot say, so I like it, because it is not a good solution. For example,
> you cannot enhance this functionality about returning searched substring.
> So you need to repeat searching. I have bad experience with using arrays in
> this style. Sometimes it is necessary, because external interfaces cannot
> work with composites, but the result is unreadable. So this is the reason
> for my opinion.
>
>
> Even if it would return arrays of a range record with "start" and "stop"
> field, I don't see how we could enhance it to later return searched
> substring without changing the return type? Doing so would break any code
> using the function anyway.
>

you can have composite (position int, value text) or (position int,
offset_bytes int, size_char int, size_bytes int), ... just there are more
possibilities


> Repeating searching if you want something else than positions, seems like
> the most SQL-idiomatic solution.
>

:)

yes, but usually you can use index on bigger data. Substring searching is
slower, and we use mostly UTF8, so if start is not in bytes, then you have
to iterate from start of string to find start of substring.



> /Joel
>
>


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Joel Jacobson
On Tue, Mar 9, 2021, at 09:29, Pavel Stehule wrote:
> 
> https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-FOREACH-ARRAY
> 
> but the SLICE argument should be constant. But this limit is artificial, just 
> for implementation simplicity.  Important is behaviour.

I see now what you mean. Yes, being able to specify the SLICE argument as a 
variable instead of a constant would be a good improvement. Maybe the SLICE 
implementation from PL/pgSQL could be modified and used for both cases? (Both 
in the C-version unnest() and in PL/pgSQL to allow variables and not just 
constants to SLICE)

> 
>> 
>> 
>>> 2. use unnest_slice as name - the function "unnest" is relatively rich 
>>> today and using other overloading doesn't look too practical.
>> 
>> Hm, rich in what way? There is currently only one version for arrays, and a 
>> different one for tsvector.
> 
> no, there is possible to unnest more arrays once

What do you mean?
More than one unnest() in the same query, e.g. SELECT unnest(..), unnest(..)?

> I cannot find any function in Postgres that returns a 2D array now.
> 
> For me - using 2D arrays is not a win. It is not a bad solution, but I cannot 
> say, so I like it, because it is not a good solution. For example, you cannot 
> enhance this functionality about returning searched substring. So you need to 
> repeat searching. I have bad experience with using arrays in this style. 
> Sometimes it is necessary, because external interfaces cannot work with 
> composites, but the result is unreadable. So this is the reason for my 
> opinion.

Even if it would return arrays of a range record with "start" and "stop" field, 
I don't see how we could enhance it to later return searched substring without 
changing the return type? Doing so would break any code using the function 
anyway.

Repeating searching if you want something else than positions, seems like the 
most SQL-idiomatic solution.

/Joel


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Pavel Stehule
út 9. 3. 2021 v 9:01 odesílatel Joel Jacobson  napsal:

> On Tue, Mar 9, 2021, at 08:26, Pavel Stehule wrote:
>
> there are two ideas:
>
> 1. the behaviour can be same like SLICE clause of FOREACH statement
>
>
> Hm, I'm sorry I don't understand, is there an existing SLICE clause?
> I get syntax error in HEAD:
>
> ERROR:  syntax error at or near "$2"
> LINE 5:   FOREACH r SLICE $2 IN ARRAY $1 --- now $2 should be consta...
>
> Or do you mean you suggest adding such a clause?
>

https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-FOREACH-ARRAY

but the SLICE argument should be constant. But this limit is artificial,
just for implementation simplicity.  Important is behaviour.


> 2. use unnest_slice as name - the function "unnest" is relatively rich
> today and using other overloading doesn't look too practical.
>
>
> Hm, rich in what way? There is currently only one version for arrays, and
> a different one for tsvector.
>

no, there is possible to unnest more arrays once


> But this is just an idea. I can imagine more forms of slicing or
> unnesting, so it can be practical to use different names than just "unnest".
>
> Personally I don't like too much using 2D arrays for this purpose. The
> queries over this functionality will be harder to read (it is like fortran
> 77). I understand so now, there is no other possibility, because pg cannot
> build array type from function signature. So it is harder to build an array
> of record types.
>
> We can make an easy tuple store of records - like FUNCTION fx(OUT a int,
> OUT b int) RETURNS SETOF RECORD. But now, thanks to Tom and Amit's work,
> the simple expression evaluation is significantly  faster than SQL
> evaluation. So using any SRF function has performance impact. What I miss
> is the possibility to write functions like FUNCTION fx(OUT a int, OUT b
> int) RETURNS ARRAY. With this possibility is easy to write functions that
> you need, and is not necessary to use 2d arrays. If the result of regexp
> functions will be arrays of records, then a new unnest function is not
> necessary. So this is not a good direction. Instead of fixing core issues,
> we design workarounds. There can be more wide usage of arrays of composites.
>
>
> Hm, I struggle to understand what your point is.
> 2D arrays already exist, and when having to deal with them, I think
> unnest(anyarray,int) would improve the situation.
>

I cannot find any function in Postgres that returns a 2D array now.

For me - using 2D arrays is not a win. It is not a bad solution, but I
cannot say, so I like it, because it is not a good solution. For example,
you cannot enhance this functionality about returning searched substring.
So you need to repeat searching. I have bad experience with using arrays in
this style. Sometimes it is necessary, because external interfaces cannot
work with composites, but the result is unreadable. So this is the reason
for my opinion.

about unnest_2d .. probably it can be used for some cases when users cannot
use composites on the client side. But now, because they can use FOREACH
SLICE is not problem to write any custom function like they exactly need.
And in this case there is very low overhead of plpgsql. But it is true, so
this function can be used for some vector unnesting.

Now, there might be other situations like you describe where something else
> than 2D arrays are preferred.
> But this doesn't change the fact you sometimes have to deal with 2D
> arrays, in which case the proposed unnest(anyarray,int) would improve the
> user-experience a lot, when wanting to unnest just one level (or N levels).
>

> Sounds like you are suggesting some other improvements, in addition to the
> proposed unnest(anyarray,int)? Correct?
>
> A regexp_positions() returning setof 2-D array[] would not be a
> workaround, in my opinion,
> it would be what I actually want, but only if I also get
> unnest(anyarray,int), then I'm perfectly happy.
>

We are talking about design, not about usage. try to write some examples of
usage, please?

Regards

Pavel


> /Joel
>


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Joel Jacobson
On Tue, Mar 9, 2021, at 08:26, Pavel Stehule wrote:
> there are two ideas:
> 
> 1. the behaviour can be same like SLICE clause of FOREACH statement

Hm, I'm sorry I don't understand, is there an existing SLICE clause?
I get syntax error in HEAD:

ERROR:  syntax error at or near "$2"
LINE 5:   FOREACH r SLICE $2 IN ARRAY $1 --- now $2 should be consta...

Or do you mean you suggest adding such a clause?

> 2. use unnest_slice as name - the function "unnest" is relatively rich today 
> and using other overloading doesn't look too practical.

Hm, rich in what way? There is currently only one version for arrays, and a 
different one for tsvector.

> But this is just an idea. I can imagine more forms of slicing or unnesting, 
> so it can be practical to use different names than just "unnest".
> 
> Personally I don't like too much using 2D arrays for this purpose. The 
> queries over this functionality will be harder to read (it is like fortran 
> 77). I understand so now, there is no other possibility, because pg cannot 
> build array type from function signature. So it is harder to build an array 
> of record types. 
> 
> We can make an easy tuple store of records - like FUNCTION fx(OUT a int, OUT 
> b int) RETURNS SETOF RECORD. But now, thanks to Tom and Amit's work, the 
> simple expression evaluation is significantly  faster than SQL evaluation. So 
> using any SRF function has performance impact. What I miss is the possibility 
> to write functions like FUNCTION fx(OUT a int, OUT b int) RETURNS ARRAY. With 
> this possibility is easy to write functions that you need, and is not 
> necessary to use 2d arrays. If the result of regexp functions will be arrays 
> of records, then a new unnest function is not necessary. So this is not a 
> good direction. Instead of fixing core issues, we design workarounds. There 
> can be more wide usage of arrays of composites.

Hm, I struggle to understand what your point is.
2D arrays already exist, and when having to deal with them, I think 
unnest(anyarray,int) would improve the situation.
Now, there might be other situations like you describe where something else 
than 2D arrays are preferred.
But this doesn't change the fact you sometimes have to deal with 2D arrays, in 
which case the proposed unnest(anyarray,int) would improve the user-experience 
a lot, when wanting to unnest just one level (or N levels).

Sounds like you are suggesting some other improvements, in addition to the 
proposed unnest(anyarray,int)? Correct?

A regexp_positions() returning setof 2-D array[] would not be a workaround, in 
my opinion,
it would be what I actually want, but only if I also get unnest(anyarray,int), 
then I'm perfectly happy.

/Joel

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Pavel Stehule
út 9. 3. 2021 v 7:57 odesílatel Joel Jacobson  napsal:

> On Mon, Mar 8, 2021, at 21:46, Pavel Stehule wrote:
>
> so what about?
>
> CREATE OR REPLACE FUNCTION unnest_slice(anyarray, int)
> RETURNS SETOF anyarray AS $$
> DECLARE r $1%type;
> BEGIN
>   FOREACH r SLICE $2 IN ARRAY $1 --- now $2 should be constant
>   LOOP
> RETURN NEXT r;
>   END LOOP;
> END;
> $$ LANGUAGE plpgsql;
>
>
> Not sure I understand. Is the suggestion to add "SLICE" as syntactic sugar
> in PL/pgSQL to invoke the proposed two-argument C-version of unnest()?
>

there are two ideas:

1. the behaviour can be same like SLICE clause of FOREACH statement

2. use unnest_slice as name - the function "unnest" is relatively rich
today and using other overloading doesn't look too practical. But this is
just an idea. I can imagine more forms of slicing or unnesting, so it can
be practical to use different names than just "unnest".

Personally I don't like too much using 2D arrays for this purpose. The
queries over this functionality will be harder to read (it is like fortran
77). I understand so now, there is no other possibility, because pg cannot
build array type from function signature. So it is harder to build an array
of record types.

We can make an easy tuple store of records - like FUNCTION fx(OUT a int,
OUT b int) RETURNS SETOF RECORD. But now, thanks to Tom and Amit's work,
the simple expression evaluation is significantly  faster than SQL
evaluation. So using any SRF function has performance impact. What I miss
is the possibility to write functions like FUNCTION fx(OUT a int, OUT b
int) RETURNS ARRAY. With this possibility is easy to write functions that
you need, and is not necessary to use 2d arrays. If the result of regexp
functions will be arrays of records, then a new unnest function is not
necessary. So this is not a good direction. Instead of fixing core issues,
we design workarounds. There can be more wide usage of arrays of composites.

Regards

Pavel






> /Joel
>
>
>


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Joel Jacobson
On Mon, Mar 8, 2021, at 21:46, Pavel Stehule wrote:
> so what about?
> 
> CREATE OR REPLACE FUNCTION unnest_slice(anyarray, int)
> RETURNS SETOF anyarray AS $$
> DECLARE r $1%type;
> BEGIN
>   FOREACH r SLICE $2 IN ARRAY $1 --- now $2 should be constant
>   LOOP
> RETURN NEXT r;
>   END LOOP;
> END;
> $$ LANGUAGE plpgsql;

Not sure I understand. Is the suggestion to add "SLICE" as syntactic sugar in 
PL/pgSQL to invoke the proposed two-argument C-version of unnest()?

/Joel



Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Pavel Stehule
po 8. 3. 2021 v 21:12 odesílatel Tom Lane  napsal:

> "Joel Jacobson"  writes:
> > If I understand it correctly, we don't need to run genbki.pl to compile
> PostgreSQL,
> > so someone wanting to compile PostgreSQL without having a running
> PostgreSQL-instance
> > could do so without problems.
> > A dependency on having a PostgreSQL instance running,
> > is perhaps acceptable for hackers developing PostgreSQL?
> > But of course not for normal users just wanting to compile PostgreSQL.
> > If we think there is at least a 1% chance this is a feasible idea,
> > I'm willing to try implementing a SQL/PLpgSQL-version of genbki.pl.
>
> No, I think this is a non-starter.  Bootstrapping from just the
> contents of the git repo is something developers do all the time
> (and indeed the buildfarm does it in every run).  We do not want to
> need a running PG instance in advance of doing that.
>
> Yeah, we could make it work if we started treating all the genbki
> output files as things to include in the git repo, but I don't think
> anybody wants to go there.
>
> I understand some folks' distaste for Perl, and indeed I don't like it
> that much myself.  If we were starting over from scratch I'm sure
> we'd choose a different language for our build/test infrastructure.
> But that's where we are, and I would not be in favor of having more
> than one scripting language as build requirements.  So Perl is going
> to be it unless somebody gets ambitious enough to replace all the Perl
> scripts at once, which seems unlikely to happen.
>
> > I agree, I like the 2-D array version, but only if a we could provide a
> C-function
> > to allow unnesting N+1 dims to N dims. Is that a fruitful idea, or are
> there
> > reasons why it cannot be done easily? I could give it a try, if we think
> it's a good idea.
>
> +1, I think this need has come up before.  My guess is that the
> hardest part of that will be choosing a function name that will
> satisfy everybody ;-).
>
> Could there be any value in allowing unnesting a variable number
> of levels?  If so, we could dodge the naming issue by inventing
> "unnest(anyarray, int) returns anyarray" where the second argument
> specifies the number of subscript levels to remove, or perhaps
> the number to keep.
>

so what about?

CREATE OR REPLACE FUNCTION unnest_slice(anyarray, int)
RETURNS SETOF anyarray AS $$
DECLARE r $1%type;
BEGIN
  FOREACH r SLICE $2 IN ARRAY $1 --- now $2 should be constant
  LOOP
RETURN NEXT r;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

Regards

Pavel

regards, tom lane
>
>
>


Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Tom Lane
"Joel Jacobson"  writes:
> If I understand it correctly, we don't need to run genbki.pl to compile 
> PostgreSQL,
> so someone wanting to compile PostgreSQL without having a running 
> PostgreSQL-instance
> could do so without problems.
> A dependency on having a PostgreSQL instance running,
> is perhaps acceptable for hackers developing PostgreSQL?
> But of course not for normal users just wanting to compile PostgreSQL.
> If we think there is at least a 1% chance this is a feasible idea,
> I'm willing to try implementing a SQL/PLpgSQL-version of genbki.pl.

No, I think this is a non-starter.  Bootstrapping from just the
contents of the git repo is something developers do all the time
(and indeed the buildfarm does it in every run).  We do not want to
need a running PG instance in advance of doing that.

Yeah, we could make it work if we started treating all the genbki
output files as things to include in the git repo, but I don't think
anybody wants to go there.

I understand some folks' distaste for Perl, and indeed I don't like it
that much myself.  If we were starting over from scratch I'm sure
we'd choose a different language for our build/test infrastructure.
But that's where we are, and I would not be in favor of having more
than one scripting language as build requirements.  So Perl is going
to be it unless somebody gets ambitious enough to replace all the Perl
scripts at once, which seems unlikely to happen.

> I agree, I like the 2-D array version, but only if a we could provide a 
> C-function
> to allow unnesting N+1 dims to N dims. Is that a fruitful idea, or are there
> reasons why it cannot be done easily? I could give it a try, if we think it's 
> a good idea.

+1, I think this need has come up before.  My guess is that the
hardest part of that will be choosing a function name that will
satisfy everybody ;-).

Could there be any value in allowing unnesting a variable number
of levels?  If so, we could dodge the naming issue by inventing
"unnest(anyarray, int) returns anyarray" where the second argument
specifies the number of subscript levels to remove, or perhaps
the number to keep.

regards, tom lane




Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Joel Jacobson
On Mon, Mar 8, 2021, at 19:46, Joel Jacobson wrote:
> However, for certain tasks, when a high-level language is preferred,
> and when the raw performance of C isn't necessary, then maybe SQL/PLpgSQL
> could be a serious alternative to Perl? 

Before we had jsonb, this would have been totally unrealistic.

But with jsonb, I think we actually have complete coverage of Perl's data types:

Perl array <=> jsonb array
Perl hash <=> jsonb object
Perl scalar <=> jsonb string/boolean/number

I've been using jsonb with great success for code generation.
ASTs are nicely represented as nested jsonb arrays.

/Joel



Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Joel Jacobson
On Mon, Mar 8, 2021, at 18:30, Tom Lane wrote:
> FWIW, I personally think that returning a start position and a length
> would be the most understandable way to operate. 

Very good point. I agree. (And then ranges cannot be used, regardless of 
canonical form.)

> Yeah: it's hard.  The amount of catalog infrastructure needed by a
> composite type is dauntingly large, and genbki.pl doesn't offer any
> support for building composite types that aren't tied to catalogs.
> (I suppose if you don't mind hacking Perl, you could try to refactor
> it to improve that.) 

I haven't studied genbki.pl in detail, but seen its name on the list many times,
maybe I should go through it to understand it in detail.

On the topic of Perl.
I've written a lot of Perl code over the years.
Trustly was initially a Perl+PostgreSQL microservice project, with different 
components
written in Perl run as daemons, communicating with each other over TCP/IP,
via JSON-RPC. We had lots of strange problems difficult to debug.
In the end, we moved all the business logics from Perl into database functions 
in PostgreSQL,
and all problems went away. The biggest win was the nice UTF-8 support,
which was really awkward in Perl. It's kind of UTF-8, but not really and not 
always.

Most programming languages/compilers are obsessed
with the concept of "bootstrapping"/"dogfooding".

Thinking of PostgreSQL as a language/compiler, that would mean we should be 
obsessed with the idea
of implementing PostgreSQL in SQL or PL/pgSQL. That would be quite a challenge 
of course.

However, for certain tasks, when a high-level language is preferred,
and when the raw performance of C isn't necessary, then maybe SQL/PLpgSQL
could be a serious alternative to Perl? 

If I understand it correctly, we don't need to run genbki.pl to compile 
PostgreSQL,
so someone wanting to compile PostgreSQL without having a running 
PostgreSQL-instance
could do so without problems.

A dependency on having a PostgreSQL instance running,
is perhaps acceptable for hackers developing PostgreSQL?
But of course not for normal users just wanting to compile PostgreSQL.

If we think there is at least a 1% chance this is a feasible idea,
I'm willing to try implementing a SQL/PLpgSQL-version of genbki.pl.
Would be a fun hack, but not if it's guaranteed time-waste.

>  It seems like you need it to return setof array(s), so the choices are
> array of composite, 2-D array, or two parallel arrays.  I'm not sure
> the first of those is so much better than the others that it's worth
> the pain involved to set up the initial catalog data that way.

I agree, I like the 2-D array version, but only if a we could provide a 
C-function
to allow unnesting N+1 dims to N dims. Is that a fruitful idea, or are there
reasons why it cannot be done easily? I could give it a try, if we think it's a 
good idea.

> 
> BTW, I don't know if you know the history here, but regexp_matches()
> is way older than regexp_match(); we eventually invented the latter
> because the former was just too hard to use for easy non-'g' cases.
> I'm inclined to think we should learn from that and provide equivalent
> variants regexp_position[s] right off the bat.

I remember! regexp_match() was a very welcomed addition.
I agree both regexp_position[s] variants would be good for same reasons.

/Joel

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Mark Dilger



> On Mar 8, 2021, at 9:20 AM, Joel Jacobson  wrote:
> 
> On Mon, Mar 8, 2021, at 18:11, Mark Dilger wrote:
>> > On Mar 8, 2021, at 9:05 AM, Joel Jacobson  wrote:
>> > 
>> > If a N+1 dimension array could easily be unnested to a N dimension array,
>> > I would prefer Tom's idea of a 2-D regexp_positions(), since it simple and 
>> > not controversial.
>> 
>> How about proposing some array functions to go along with the 
>> regexp_positions, and then do it that way?
> 
> Sounds like a nice solution. That would be a huge win when dealing with 
> multidimensional arrays in general.
> 
> Do we have strong support on the list for such a function? If so, I can make 
> an attempt implementing it, unless some more experienced hacker wants to do 
> it.

That's a hard question to answer in advance.  Typically, you need to propose a 
solution, and then get feedback.  You wouldn't need to post a patch, but 
perhaps some examples of how you would expect it to work, like

+SELECT slice('{{1,2,3,4},{5,6,7,8},{9,10,11,12},{13,14,15,16}}'::integer[][], 
'[2,4)'::int4range);
+   slice   
+---
+ {{2,3}}
+ {{6,7}}
+ {{10,11}}
+ {{14,15}}
+(4 rows)
+
+SELECT 
slice('{{{1,2,3},{4,5,6},{7,8,9}},{{10,11,12},{13,14,15},{16,17,18}}}'::integer[][][],
 '[2,4)'::int4range);
+   slice   
+---
+ {{{4,5,6},{7,8,9}}}
+ {{{13,14,15},{16,17,18}}}
+(2 rows)
+

and then people can tell you why they hate that choice of interface.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Chapman Flack
On 03/08/21 13:29, Chapman Flack wrote:
> I think the s-free version is exactly the regexp_instr included in
> the other concurrent proposal [1]

sorry.

[1]
https://www.postgresql.org/message-id/fc160ee0-c843-b024-29bb-97b5da61971f%40darold.net




Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Chapman Flack
On 03/08/21 12:30, Tom Lane wrote:
> I'm inclined to think we should learn from that and provide equivalent
> variants regexp_position[s] right off the bat.

I think the s-free version is exactly the regexp_instr included in
the other concurrent proposal [1], which closely corresponds to the
ISO position_regex() except for the ISO one using XQuery regex syntax.

I gather from [1] that the name regexp_instr is chosen in solidarity
with other DBMSs that de facto have it. Would it be weirder to have the
singular form be regexp_instr and the plural be regexp_positions?
Or to diverge from the other systems' de facto convention and name
the singular form regexp_position? (Or the plural form regexp_instrs?
That sounds to me like a disassembler for regexps. Or regexps_instr,
like attorneys general? Never mind.)

Regards,
-Chap




Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Tom Lane
"Joel Jacobson"  writes:
> I prefer to think of a match as two points. If the points are at the same 
> position, it's a zero length match.

FWIW, I personally think that returning a start position and a length
would be the most understandable way to operate.  If you report start
position and end position then there is always going to be confusion
over whether the end position is inclusive or exclusive (that is,
some code including our regex library thinks of the "end" as being
"first character after the match").  This is indeed the same
definitional issue you're contending with vis-a-vis range endpoints,
only now you lack any pre-existing definition that people might rely on
to know what you meant.

> Since there are currently zero composite type returning catalog functions, I 
> can see why the idea of returning a "range" with two "start" and "stop" 
> fields is controversial. There are probably good reasons that I fail to see 
> why there are no composite type returning functions in the catalogs. Ideas on 
> why this is the case, anyone?

Yeah: it's hard.  The amount of catalog infrastructure needed by a
composite type is dauntingly large, and genbki.pl doesn't offer any
support for building composite types that aren't tied to catalogs.
(I suppose if you don't mind hacking Perl, you could try to refactor
it to improve that.)  Up to now we've avoided the need for that,
since a function can be declared to return an anonymous record type
by giving it some OUT parameters.  However, if I'm understanding
things correctly "regexp_positions(IN ..., OUT match_start integer,
OUT match_length integer) RETURNS SETOF record" wouldn't be enough
for you, because you really need a 2-D tableau of match data to
handle the case of multiple capturing parens plus 'g' mode.  It
seems like you need it to return setof array(s), so the choices are
array of composite, 2-D array, or two parallel arrays.  I'm not sure
the first of those is so much better than the others that it's worth
the pain involved to set up the initial catalog data that way.

BTW, I don't know if you know the history here, but regexp_matches()
is way older than regexp_match(); we eventually invented the latter
because the former was just too hard to use for easy non-'g' cases.
I'm inclined to think we should learn from that and provide equivalent
variants regexp_position[s] right off the bat.

regards, tom lane




Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Joel Jacobson
On Mon, Mar 8, 2021, at 18:11, Mark Dilger wrote:
> > On Mar 8, 2021, at 9:05 AM, Joel Jacobson  wrote:
> > 
> > If a N+1 dimension array could easily be unnested to a N dimension array,
> > I would prefer Tom's idea of a 2-D regexp_positions(), since it simple and 
> > not controversial.
> 
> How about proposing some array functions to go along with the 
> regexp_positions, and then do it that way?

Sounds like a nice solution. That would be a huge win when dealing with 
multidimensional arrays in general.

Do we have strong support on the list for such a function? If so, I can make an 
attempt implementing it, unless some more experienced hacker wants to do it.

/Joel


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Mark Dilger



> On Mar 8, 2021, at 9:05 AM, Joel Jacobson  wrote:
> 
> If a N+1 dimension array could easily be unnested to a N dimension array,
> I would prefer Tom's idea of a 2-D regexp_positions(), since it simple and 
> not controversial.

How about proposing some array functions to go along with the regexp_positions, 
and then do it that way?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Joel Jacobson
On Mon, Mar 8, 2021, at 17:20, Mark Dilger wrote:
> > On Mar 5, 2021, at 11:46 AM, Joel Jacobson  wrote:
> > <0003-regexp-positions.patch>
> 
> I did a bit more testing:
> 
> +SELECT regexp_positions('foobarbequebaz', 'b', 'g');
> + regexp_positions 
> +--
> + {"[3,5)"}
> + {"[6,8)"}
> + {"[11,13)"}
> +(3 rows)
> +
> 
> I understand that these ranges are intended to be read as one character long 
> matches starting at positions 3, 6, and 11, but they look like they match 
> either two or three characters, depending on how you read them, and users 
> will likely be confused by that.
> 
> +SELECT regexp_positions('foobarbequebaz', '(?=beque)', 'g');
> + regexp_positions 
> +--
> + {"[6,7)"}
> +(1 row)
> +
> 
> This is a zero length match.  As above, it might be confusing that a zero 
> length match reads this way.
> 
> +SELECT regexp_positions('foobarbequebaz', '(?<=z)', 'g');
> + regexp_positions 
> +--
> + {"[14,15)"}
> +(1 row)
> +
> 
> Same here, except this time position 15 is referenced, which is beyond the 
> end of the string.
> 
> I think a zero length match at the end of this string should read as 
> {"[14,14)"}, and you have been forced to add one to avoid that collapsing 
> down to "empty", but I'd rather you found a different datatype rather than 
> abuse the definition of int4range.
> 
> It seems that you may have reached a similar conclusion down-thread?

This is due to the, in my opinion, unfortunate decision of using 
inclusive/exclusive as the canonical form for discrete types.
Probably not much we can do about that, but that's what we have, so I think 
it's fine.

[6,7) is exactly the same thing as [6,6] for discrete types, it simply means 
the startpos and endpos both are 6.

I prefer to think of a match as two points. If the points are at the same 
position, it's a zero length match.

In the example, the startpos and endpos are both at 6, so it's a zero length 
match.,

This was very confusing to me at first. I wrongly thought I needed an empty 
int4range and had the perverse idea of hacking the range type to allow setting 
lower and upper even though it was empty. This was a really really bad idea 
which I feel stupid of even considering. It was before I understood a zero 
length match should actually *not* be represented as an empty int4range, but as 
an int4range covering exactly one single integer, since startpos=endpos. This 
was due to my off-by-one error. With that fixed, the only problem is the (in my 
opinion) unnatural canonical form for discrete types, since in this context 
it's just silly to talk about inclusive/exclusive. I think inclusive/inclusive 
would have been much more SQL idiomatic, since that's the semantics for BETWEEN 
in SQL, it's inclusive/inclusive. So is most other programming environments 
I've seen.

However, not much we can do about that for int4range/int8range,
but maybe multirange could change the canonical form going forward.

Even if not changed, I think int4range works just fine. It just requires a bit 
more mental effort to understand what the values mean. Probably an annoyance 
for users at first, but I think they easily will understand they should just do 
"-1" on the upper() value (but only if upper_inc() is FALSE, but you know that 
for sure for int4ranges, so it is really necessary, one might wonder).

If a N+1 dimension array could easily be unnested to a N dimension array,
I would prefer Tom's idea of a 2-D regexp_positions(), since it simple and not 
controversial.

Since there are currently zero composite type returning catalog functions, I 
can see why the idea of returning a "range" with two "start" and "stop" fields 
is controversial. There are probably good reasons that I fail to see why there 
are no composite type returning functions in the catalogs. Ideas on why this is 
the case, anyone?

/Joel


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-08 Thread Mark Dilger



> On Mar 5, 2021, at 11:46 AM, Joel Jacobson  wrote:
> 
> 
> /Joel
> <0003-regexp-positions.patch>

I did a bit more testing:

+SELECT regexp_positions('foobarbequebaz', 'b', 'g');
+ regexp_positions 
+--
+ {"[3,5)"}
+ {"[6,8)"}
+ {"[11,13)"}
+(3 rows)
+

I understand that these ranges are intended to be read as one character long 
matches starting at positions 3, 6, and 11, but they look like they match 
either two or three characters, depending on how you read them, and users will 
likely be confused by that.

+SELECT regexp_positions('foobarbequebaz', '(?=beque)', 'g');
+ regexp_positions 
+--
+ {"[6,7)"}
+(1 row)
+

This is a zero length match.  As above, it might be confusing that a zero 
length match reads this way.

+SELECT regexp_positions('foobarbequebaz', '(?<=z)', 'g');
+ regexp_positions 
+--
+ {"[14,15)"}
+(1 row)
+

Same here, except this time position 15 is referenced, which is beyond the end 
of the string.

I think a zero length match at the end of this string should read as 
{"[14,14)"}, and you have been forced to add one to avoid that collapsing down 
to "empty", but I'd rather you found a different datatype rather than abuse the 
definition of int4range.

It seems that you may have reached a similar conclusion down-thread?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-05 Thread Joel Jacobson
On Fri, Mar 5, 2021, at 20:46, Joel Jacobson wrote:
> My conclusion is that we should use setof int4range[] as the return value for 
> regexp_positions().

If acceptable by the project, it be even nicer if we could just return the 
suggested composite type.

I don't see any existing catalog functions returning composite types though?
Is this due to some policy of not wanting composite types as return values for 
built-ins or just a coincidence?

Example on regexp_positions -> setof range[] 
where range is:

CREATE TYPE range AS (start int8, stop int8);

SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
regexp_positions
--
{"(6,6)"}
(1 row)

SELECT r FROM regexp_positions('foobarbequebazilbarfbonk', 
$re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
   r
---
{"(3,6)","(6,11)"}
{"(11,16)","(16,20)"}
(2 rows)

SELECT r[1].*, r[2].* FROM regexp_positions('foobarbequebazilbarfbonk', 
$re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
start | stop | start | stop
---+--+---+--
 3 |6 | 6 |   11
11 |   16 |16 |   20
(2 rows)

SELECT r[1].* FROM regexp_positions('','^','g') AS r;
start | stop
---+--
 0 |0
(1 row)

Thoughts?

/Joel

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-05 Thread Joel Jacobson
On Tue, Mar 2, 2021, at 01:12, Mark Dilger wrote:
> I like the idea so I did a bit of testing.  I think the following should not 
> error, but does:
> 
> +SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
> +ERROR:  range lower bound must be less than or equal to range upper bound

Doh! How stupid of me. I realize now I had a off-by-one thinko in my 0001 patch 
using int4range.

I didn't use the raw "so" and "eo" values in regexp.c like I should have,
instead, I incorrectly used (so + 1) as the startpos,
and just eo as the endpos.

This is what caused all the problems.

The fix is simple:
-  lower.val = Int32GetDatum(so + 1);
+ lower.val = Int32GetDatum(so);

The example that gave the error now works properly:

SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
regexp_positions
--
{"[6,7)"}
(1 row)

I've also created a SQL PoC of the composite range type idea,
and convenience wrapper functions for int4range and int8range.

CREATE TYPE range AS (start int8, stop int8);

Helper functions:
range(start int8, stop int8) -> range
range(int8range) -> range
range(int4range) -> range
range(int8range[]) -> range[]
range(int4range[]) -> range[]

Demo:

regexp_positions() returns setof int4range[]:

SELECT r FROM regexp_positions('foobarbequebazilbarfbonk', 
$re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
   r
---
{"[3,7)","[6,12)"}
{"[11,17)","[16,21)"}
(2 rows)

Convert int4range[] -> range[]:

SELECT range(r) FROM regexp_positions('foobarbequebazilbarfbonk', 
$re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
 range
---
{"(3,6)","(6,11)"}
{"(11,16)","(16,20)"}
(2 rows)

"start" and "stop" fields:

SELECT (range(r[1])).* FROM regexp_positions('foobarbequebazilbarfbonk', 
$re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
start | stop
---+--
 3 |6
11 |   16
(2 rows)

zero-length match at beginning:

SELECT r FROM regexp_positions('','^','g') AS r;
 r
---
{"[0,1)"}
(1 row)

SELECT (range(r[1])).* FROM regexp_positions('','^','g') AS r;
start | stop
---+--
 0 |0
(1 row)

My conclusion is that we should use setof int4range[] as the return value for 
regexp_positions().

New patch attached.

The composite range type and helper functions are of course not at all 
necessary,
but I think they would be a nice addition, to make it easier to work with ranges
for composite types. I intentionally didn't create anyrange versions of them,
since they can only support composite types,
since they don't require the inclusive/exclusive semantics.

/Joel

range.sql
Description: Binary data


0003-regexp-positions.patch
Description: Binary data


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-05 Thread Andreas Karlsson

On 3/4/21 4:40 PM, Tom Lane wrote:

I wonder if a 2-D integer array wouldn't be a better idea,
ie {{startpos1,length1},{startpos2,length2},...}.  My experience
with working with parallel arrays in SQL has been unpleasant.


Hm, I can see your point but on the other hand I can't say my 
experiences working with 2-D arrays have been that pleasant either. The 
main issue being how there is no simple way to unnest just one dimension 
of the array. Maybe it would be worth considering implementing a 
function for that.


As far as I know to unnest just one dimension you would need to use 
generate_series() or something like the query below. Please correct me 
if I am wrong and there is some more ergonomic way to do it.


WITH d (a) AS (SELECT '{{2,3},{4,5}}'::int[])
SELECT array_agg(unnest) FROM d, unnest(a) WITH ORDINALITY GROUP BY 
(ordinality - 1) / array_length(a, 2);


Andreas




Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-05 Thread Pavel Stehule
Hi

pá 5. 3. 2021 v 13:44 odesílatel Joel Jacobson  napsal:

> Idea #5:
>
> Allow disabling canonicalization via optional parameter to range
> constructor functions.
>

I think so rules describing ranges and multirages are long enough, so
increasing functionality doesn't look like a practical idea.

I prefere special simple composite type like you described in the previous
email (start, stop) or (start, length). It can be used more times when
using range or multi range is not practical.

The composite types are more natural for this purpose than 2D arrays.

Regards

Pavel


> This would then allow using the range type,
> to create inclusive/inclusive integer ranges,
> where lower() and upper() would return what you expect.
>
> /Joel
>


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-05 Thread Joel Jacobson
Idea #5:

Allow disabling canonicalization via optional parameter to range constructor 
functions.

This would then allow using the range type,
to create inclusive/inclusive integer ranges,
where lower() and upper() would return what you expect.

/Joel

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-05 Thread Joel Jacobson
On Thu, Mar 4, 2021, at 17:55, Gilles Darold wrote:
> I also think that it should return a setof 2-D integer array, an other 
> solution is to return all start/end positions of an occurrence chained 
> in an integer array {start1,end1,start2,end2,..}.

Hmm. Seems like we've in total managed to come up with three flawed ideas.

Pros/cons I see:

Idea #1: setof 2-D integer array
+ Packs the values into one single value.
- Difficult to work with 2-D arrays, doesn't work well with unnest(), has to 
inspect the dims and use for loops to extract values.
- Looking at a 2-D value, it's not obvious what the integer values means in it 
means. Which one is "startpos" and do we have "length" or "endpos" values?

Idea #2: setof (start_pos integer[], end_pos integer[])
+ It's obvious to the user what type of integers "start_pos" and "end_pos" 
contain.
- Decouples the values into two separate values.
- Tom mentioned some bad experiences with separate array values. (Details on 
this would be interesting.)

Idea #3: chained integer array {start1,end1,start2,end2,..}
- Mixes different values into the same value
- Requires maths (although simple calculations) to extract values

I think all three ideas (including mine) are ugly. None of them is wart free.

Idea #4: add a new composite built-in type.

A simple composite type with two int8 fields.

The field names seems to vary a lot between languages:

Rust: "start", "end" [1]
C++: "begin", "end" [2]
Python: "start", "stop" [3]

Such a simple composite type, could then always be used,
when you want to represent simple integer ranges,
between two exact values, arguably a very common need.

Such type could be converted to/from int8range,
but would have easily accessible field names,
which is simpler than using lower() and upper(),
since upper() always returns the canonical
exclusive upper bound for discrete types,
which is not usually what you want when
dealing with "start" and "end" integer ranges.

Since there is no type named just "range", why not just use this name?

Since "end" is a keyword, I suggest the "stop" name:

PoC:

CREATE TYPE range AS (start int8, stop int8);

A real implementation would of course also verify CHECK (start <= stop),
and would add conversions to/from int8range.

I realise this is probably a controversial idea.
But, I think this is a general common problem that deserves a clean general 
solution.

Thoughts? More ideas?

[1] https://doc.rust-lang.org/std/ops/struct.Range.html
[2] https://en.cppreference.com/w/cpp/ranges
[3] https://www.w3schools.com/python/ref_func_range.asp


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-04 Thread Gilles Darold

Le 04/03/2021 à 16:40, Tom Lane a écrit :

"Joel Jacobson"  writes:

Having abandoned the cute idea that didn't work,
here comes a new patch with a regexp_positions() instead returning
setof record (start_pos integer[], end_pos integer[]).

I wonder if a 2-D integer array wouldn't be a better idea,
ie {{startpos1,length1},{startpos2,length2},...}.  My experience
with working with parallel arrays in SQL has been unpleasant.

Also, did you see

https://www.postgresql.org/message-id/fc160ee0-c843-b024-29bb-97b5da61971f%40darold.net

Seems like there may be some overlap in these proposals.



The object of regexp_position() is to return all start+end of captured 
substrings, it overlaps a little with regexp_instr() in the way that 
this function returns the start or end position of a specific captured 
substring. I think it is a good idea to have a function that returns all 
positions instead of a single one like regexp_instr(), this is not the 
same usage. Actually regexp_position() is exactly the same as 
regexp_matches() except that it return positions instead of substrings.



I also think that it should return a setof 2-D integer array, an other 
solution is to return all start/end positions of an occurrence chained 
in an integer array {start1,end1,start2,end2,..}.



Regards,

--
Gilles Darold





Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-04 Thread Joel Jacobson
On Thu, Mar 4, 2021, at 16:40, Tom Lane wrote:
> "Joel Jacobson"  writes:
> > Having abandoned the cute idea that didn't work,
> > here comes a new patch with a regexp_positions() instead returning
> > setof record (start_pos integer[], end_pos integer[]).
> 
> I wonder if a 2-D integer array wouldn't be a better idea,
> ie {{startpos1,length1},{startpos2,length2},...}.  My experience
> with working with parallel arrays in SQL has been unpleasant.

I considered it, but I prefer two separate simple arrays for two reasons:

a) more pedagogic, it's at least then obvious what values are start and end 
positions,
then you only have to understand what the values means.

b) 2-D doesn't arrays don't work well with unnest().
If you would unnest() the 2-D array you couldn't separate the start positions 
from the end positions,
whereas with two separate, you could do:

SELECT unnest(start_pos) AS start_pos, unnest(end_pos) AS end_pos FROM 
regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
start_pos | end_pos
---+-
 3 |   6
 6 |  11
11 |  16
16 |  20
(4 rows)

Can give some details on your unpleasant experiences of parallel arrays?


> 
> Also, did you see
> 
> https://www.postgresql.org/message-id/fc160ee0-c843-b024-29bb-97b5da61971f%40darold.net
> 
> 
> Seems like there may be some overlap in these proposals.

Yes, I saw it, it was sent shortly after my proposal, so I couldn't take it 
into account.
Seems useful, except regexp_instr() seems redundant, I would rather have 
regexp_positions(),
but maybe regexp_instr() should also be added for compatibility reasons.

/Joel


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-04 Thread Chapman Flack
On 03/04/21 10:40, Tom Lane wrote:
> Also, did you see
> 
> https://www.postgresql.org/message-id/fc160ee0-c843-b024-29bb-97b5da61971f%40darold.net
> 
> Seems like there may be some overlap in these proposals.

Not only that, the functions in that other proposal are very similar
to the standard's own functions that are specified to use XML Query
regular expression syntax (sample implementations in [1]).

These differently-named (which is good) functions seem to be a de facto
standard where the regexp syntax and semantics are those native to the
DBMS, the correspondence being

de facto  ISO XQuery-based
  -- --
  regexp_likelike_regex
  regexp_count   occurrences_regex
  regexp_instr   position_regex
  regexp_substr  substring_regex
  regexp_replace translate_regex


The regexp_positions proposal highlights an interesting apparent gap in
both the de facto and the ISO specs: the provided functions allow you
to specify which occurrence you're talking about, and get the corresponding
positions or the corresponding substring, but neither set of functions
includes one to just give you all the matching positions at once as
a SETOF something.

What the proposed regexp_positions() returns is pretty much exactly
the notional "list of match vectors" that appears internally throughout
the specs of the ISO functions, but is never directly exposed.

In the LOMV as described in the standard, the position/length arrays
are indexed from zero, and the start and length at index 0 are those
for the overall match as a whole.

Right now, if you have a query that involves, say,

  substring_regex('(b[^b]+)(b[^b]+)' IN str GROUP 1) and also
  substring_regex('(b[^b]+)(b[^b]+)' IN str GROUP 2),

a naïve implementation like [1] will of course compile and evaluate
the regexp twice and return one group each time. It makes me wonder
whether the standards committee was picturing a clever parse analyzer
and planner that would say "aha! you want group 1 and group 2 from
a single evaluation of this regex!", and that might even explain the
curious rule in the standard that the regex must be an actual literal,
not any other expression. (Still, that strikes me as an awkward way to
have to write it, spelling the regex out as a literal, twice.)

It has also made my idly wonder how close we could get to behaving
that way, perhaps with planner support functions and other available
parse analysis/planning hooks. Would any of those mechanisms get a
sufficiently global view of the query to do that kind of rewriting?

Regards,
-Chap


[1]
https://tada.github.io/pljava/pljava-examples/apidocs/org/postgresql/pljava/example/saxon/S9.html#method.summary




Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-04 Thread Tom Lane
"Joel Jacobson"  writes:
> Having abandoned the cute idea that didn't work,
> here comes a new patch with a regexp_positions() instead returning
> setof record (start_pos integer[], end_pos integer[]).

I wonder if a 2-D integer array wouldn't be a better idea,
ie {{startpos1,length1},{startpos2,length2},...}.  My experience
with working with parallel arrays in SQL has been unpleasant.

Also, did you see

https://www.postgresql.org/message-id/fc160ee0-c843-b024-29bb-97b5da61971f%40darold.net

Seems like there may be some overlap in these proposals.

regards, tom lane




Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-04 Thread Joel Jacobson
On Tue, Mar 2, 2021, at 06:31, Tom Lane wrote:
> "Joel Jacobson"  writes:
> > Unless fixed, then the way I see it, I don't think we can use int4range[] 
> > for regexp_positions(),
> 
> Yeah.  It's a cute idea, but the semantics aren't quite right.

Having abandoned the cute idea that didn't work,
here comes a new patch with a regexp_positions() instead returning
setof record (start_pos integer[], end_pos integer[]).

Example:

SELECT * FROM regexp_positions('foobarbequebazilbarfbonk', 
$re$(b[^b]+)(b[^b]+)$re$, 'g');
start_pos | end_pos
---+-
{3,6} | {6,11}
{11,16}   | {16,20}
(2 rows)

Based on HEAD (040af779382e8e4797242c49b93a5a8f9b79c370).

I've updated docs and tests.

/Joel

0002-regexp-positions.patch
Description: Binary data


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-02 Thread Mark Dilger



> On Mar 2, 2021, at 5:34 AM, Isaac Morland  wrote:
> 
> Returning to the RE result issue, I wonder how much it actually matters where 
> any empty matches are. Certainly the actual contents of the match don’t 
> matter; you don’t need to be able to index into the string to extract the 
> substring. The only scenario I can see where it could matter is if the RE is 
> using lookahead or look back to find occurrences before or after something 
> else. If we stipulate that the result array will be in order, then you still 
> don’t have the exact location of empty matches but you do at least have where 
> they are relative to non-empty matches.

I agree the contents of the match don't matter, because they are always empty.  
But the position matters.  You could intend to split a string in multiple 
places using lookaheads and lookbehinds to determine the split points.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-02 Thread Joel Jacobson
On Tue, Mar 2, 2021, at 15:05, Isaac Morland wrote:
> Suppose the match results are:
> 
> [4,8)
> [10,10)
> [13,16)
> [20,20)
> [24,24)
> 
> Then this gets turned into:
> 
> [4,8)
> empty
> [13,16)
> empty
> empty
> 
> So you know that there are non-empty matches from 4-8 and 13-16, plus an 
> empty match between them and two empty matches at the end. Given that all 
> empty strings are identical, I think it's only in pretty rare circumstances 
> where you need to know exactly where the empty matches are; it would have to 
> be a matter of identifying empty matches immediately before or after a 
> specific pattern; in which case I suspect it would usually be just as easy to 
> match the pattern itself directly.
> 
> Does this help?

Thanks, I see what you mean now.

I agree it's probably a corner-case,
but I think I would still prefer a complete solution by just returning setof 
two integer[] values,
instead of the cuter-but-only-partial solution of using the existing 
int4range[].

Even better would be if we could fix the range type so it could actually be 
used in this and other similar situations.

If so, then we could do:

SELECT r FROM regexp_positions('caaabaaabeee','(?<=b)a+','g') AS r;
 r
---
{"[6,9)"}
(1 row)

SELECT r FROM regexp_positions('caaabaaabeee','(?<=b)','g') AS r;
r
-
{empty}
{empty}
(2 rows)

SELECT lower(r[1]), upper(r[1]) FROM 
regexp_positions('caaabaaabeee','(?<=b)','g') AS r;
lower | upper
---+---
 5 | 5
 9 | 9
(2 rows)

/Joel

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-02 Thread Isaac Morland
On Tue, 2 Mar 2021 at 08:58, Joel Jacobson  wrote:

> If we stipulate that the result array will be in order, then you still
> don’t have the exact location of empty matches but you do at least have
> where they are relative to non-empty matches.
>
>
> This part I didn't fully understand. Can you please provide some example
> on this?
>

Suppose the match results are:

[4,8)
[10,10)
[13,16)
[20,20)
[24,24)

Then this gets turned into:

[4,8)
empty
[13,16)
empty
empty

So you know that there are non-empty matches from 4-8 and 13-16, plus an
empty match between them and two empty matches at the end. Given that all
empty strings are identical, I think it's only in pretty rare circumstances
where you need to know exactly where the empty matches are; it would have
to be a matter of identifying empty matches immediately before or after a
specific pattern; in which case I suspect it would usually be just as easy
to match the pattern itself directly.

Does this help?


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-02 Thread Joel Jacobson
Hi Isaac,

Many thanks for the comments.

On Tue, Mar 2, 2021, at 14:34, Isaac Morland wrote:
> One question I would have is whether empty ranges are all equal to each 
> other. If they are, you have an equality that isn’t really equality; if they 
> aren’t then you would have ranges that are unequal even though they have 
> exactly the same membership. Although I suppose this is already true for some 
> types where ends can be specified as open or closed but end up with the same 
> end element; many range types canonicalize to avoid this but I don’t think 
> they all do.

I thought about this problem too. I don't think there is a perfect solution.
Leaving things as they are is problematic too since it makes the range type 
useless for some use-cases.
I've sent a patch in a separate thread with the least invasive idea I could 
come up with.
 
> Returning to the RE result issue, I wonder how much it actually matters where 
> any empty matches are. Certainly the actual contents of the match don’t 
> matter; you don’t need to be able to index into the string to extract the 
> substring. The only scenario I can see where it could matter is if the RE is 
> using lookahead or look back to find occurrences before or after something 
> else.

Hmm, I think it would be ugly to have corner-cases handled differently than the 
rest.

> If we stipulate that the result array will be in order, then you still don’t 
> have the exact location of empty matches but you do at least have where they 
> are relative to non-empty matches.

This part I didn't fully understand. Can you please provide some example on 
this?

/Joel


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-02 Thread Isaac Morland
On Tue, 2 Mar 2021 at 00:52, Joel Jacobson  wrote:

> Ranges are treated as sets. As such equality is defined by membership.
>
> That being said, I agree that there may be situations in which it would be
> convenient to have empty ranges at specific locations. Doing this would
> introduce numerous questions which would have to be resolved. For example,
> where/when is the empty range resulting from an intersection operation?
>
>
> Hmm, I think I would assume the intersection of two non-overlapping ranges
> to be isempty()=TRUE,
> and for lower() and upper() to continue to return NULL.
>
> But I think a zero-length range created with actual bounds should
> return the lower() and upper() values during creation, instead of NULL.
>
> I tried to find some other programming environments with range types.
>
> The first one I found was Ada.
>

Interesting!

Array indices are a bit different than general ranges however.

One question I would have is whether empty ranges are all equal to each
other. If they are, you have an equality that isn’t really equality; if
they aren’t then you would have ranges that are unequal even though they
have exactly the same membership. Although I suppose this is already true
for some types where ends can be specified as open or closed but end up
with the same end element; many range types canonicalize to avoid this but
I don’t think they all do.

Returning to the RE result issue, I wonder how much it actually matters
where any empty matches are. Certainly the actual contents of the match
don’t matter; you don’t need to be able to index into the string to extract
the substring. The only scenario I can see where it could matter is if the
RE is using lookahead or look back to find occurrences before or after
something else. If we stipulate that the result array will be in order,
then you still don’t have the exact location of empty matches but you do at
least have where they are relative to non-empty matches.


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-02 Thread Joel Jacobson
On Tue, Mar 2, 2021, at 06:31, Tom Lane wrote:
>"Joel Jacobson"  writes:
>> Unless fixed, then the way I see it, I don't think we can use int4range[] 
>> for regexp_positions(),
>
>Yeah.  It's a cute idea, but the semantics aren't quite right.

I think there is a case to allow creating empty ranges *with* bounds 
information, e.g. '[6,7)'::int4range,
as well as the current only possibility to create empty ranges *without* bounds 
information, e.g. 'empty'::int4range

I've had a look at how ranges are implemented,
and I think I've found a way to support this is a simple non-invasive way.

I've outlined the idea in a patch, which I will send separately,
as it's a different feature, possibly useful for purposes other than 
regexp_positions().

/Joel

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-01 Thread Joel Jacobson
On Tue, Mar 2, 2021, at 06:22, Isaac Morland wrote:
> On Tue, 2 Mar 2021 at 00:06, Joel Jacobson  wrote:
>> I find it strange two ranges of zero-length with different bounds are 
>> considered equal:
>> 
>> SELECT '[7,7)'::int4range = '[8,8)'::int4range;
>> ?column?
>> --
>> t
>> (1 row)
>> 
>> This seems like a bug to me. What am I missing here?
>> 
>> Unless fixed, then the way I see it, I don't think we can use int4range[] 
>> for regexp_positions(),
>> if we want to allow returning the positions for zero-length matches, which 
>> would be nice.
> 
> Ranges are treated as sets. As such equality is defined by membership.
> 
> That being said, I agree that there may be situations in which it would be 
> convenient to have empty ranges at specific locations. Doing this would 
> introduce numerous questions which would have to be resolved. For example, 
> where/when is the empty range resulting from an intersection operation?

Hmm, I think I would assume the intersection of two non-overlapping ranges to 
be isempty()=TRUE,
and for lower() and upper() to continue to return NULL.

But I think a zero-length range created with actual bounds should
return the lower() and upper() values during creation, instead of NULL.

I tried to find some other programming environments with range types.

The first one I found was Ada.

The below example is similar to int4range(7,6,'[]') which is invalid in 
PostgreSQL:

with Ada.Text_IO; use Ada.Text_IO;
procedure Hello is
   type Foo is range 7 .. 6;
begin
   Put_Line ( Foo'Image(Foo'First) );
   Put_Line ( Foo'Image(Foo'Last) );
end Hello;

$ ./gnatmake hello.adb
$ ./hello
7
6

I Ada, the 'Range of the Empty_String is 1 .. 0
https://en.wikibooks.org/wiki/Ada_Programming/Types/array#Array_Attributes

I think there is a case for allowing access to the the lower/upper vals instead 
of returning NULL,
since we can do so without changing what isempty() would return for the same 
values,.

/Joel

Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-01 Thread Tom Lane
"Joel Jacobson"  writes:
> Unless fixed, then the way I see it, I don't think we can use int4range[] for 
> regexp_positions(),

Yeah.  It's a cute idea, but the semantics aren't quite right.

regards, tom lane




Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-01 Thread Isaac Morland
On Tue, 2 Mar 2021 at 00:06, Joel Jacobson  wrote:

> I find it strange two ranges of zero-length with different bounds are
> considered equal:
>
> SELECT '[7,7)'::int4range = '[8,8)'::int4range;
> ?column?
> --
> t
> (1 row)
>
> This seems like a bug to me. What am I missing here?
>
> Unless fixed, then the way I see it, I don't think we can use int4range[]
> for regexp_positions(),
> if we want to allow returning the positions for zero-length matches, which
> would be nice.
>

Ranges are treated as sets. As such equality is defined by membership.

That being said, I agree that there may be situations in which it would be
convenient to have empty ranges at specific locations. Doing this would
introduce numerous questions which would have to be resolved. For example,
where/when is the empty range resulting from an intersection operation?


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-01 Thread Joel Jacobson
On Tue, Mar 2, 2021, at 01:12, Mark Dilger wrote:
> I like the idea so I did a bit of testing.  I think the following should not 
> error, but does:
>
> +SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
> +ERROR:  range lower bound must be less than or equal to range upper bound

Thanks for testing!

The bug is due to using an (inclusive,inclusive) range,
so for the example the code tried to construct a (7,6,'[]') range.

When trying to fix, I think I've found a general problem with ranges:

I'll use int4range() to demonstrate the problem:

First the expected error for what the patch tries to do internally using 
make_range().
This is all good:

# SELECT int4range(7,6,'[]');
ERROR:  range lower bound must be less than or equal to range upper bound

I tried to fix this like this:

@ src/backend/utils/adt/regexp.c
-   lower.val = Int32GetDatum(so + 1);
+   lower.val = Int32GetDatum(so);
lower.infinite = false;
-   lower.inclusive = true;
+   lower.inclusive = false;
lower.lower = true;

Which would give the same result as doing:

SELECT int4range(6,6,'(]');
int4range
---
empty
(1 row)

Hmm. This "empty" value what surprise to me.
I would instead have assumed the canonical form "[7,7)".

If I wanted to know if the range is empty or not,
I would have guessed I should use the isempty() function, like this:

SELECT isempty(int4range(6,6,'(]'));
isempty
-
t
(1 row)

Since we have this isempty() function, I don't see the point of discarding
the lower/upper vals, since they contain possibly interesting information
on where the empty range exists.

I find it strange two ranges of zero-length with different bounds are 
considered equal:

SELECT '[7,7)'::int4range = '[8,8)'::int4range;
?column?
--
t
(1 row)

This seems like a bug to me. What am I missing here?

Unless fixed, then the way I see it, I don't think we can use int4range[] for 
regexp_positions(),
if we want to allow returning the positions for zero-length matches, which 
would be nice.

/Joel

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-01 Thread Mark Dilger



> On Mar 1, 2021, at 9:38 AM, Joel Jacobson  wrote:
> 
> Forgive me for just sending a patch without much discussion on the list,
> but it was so easy to implement, so I thought an implementation can
> help the discussion on if this is something we want or not.

I like the idea so I did a bit of testing.  I think the following should not 
error, but does:

+SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
+ERROR:  range lower bound must be less than or equal to range upper bound


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







[PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-01 Thread Joel Jacobson
Hi,

I suggest adding a new function, regexp_positions(),
which works exactly like regexp_matches(),
except it returns int4range[] start/end positions for *where* the matches 
occurs.

I first thought I could live without this function,
and just get the positions using strpos(),
but as Andreas Karlsson kindly helped me understand,
that naive idea doesn't always work.

Andreas provided this pedagogic example
to demonstrate the problem:

SELECT regexp_matches('caaabaaa', '(?<=b)(a+)', 'g');
regexp_matches

{aaa}
(1 row)

If we would try to use strpos() to find the position,
based on the returned matched substring,
we would get the wrong answer:

SELECT strpos('caaabaaa','aaa');
strpos

  2
(1 row)

Sure, there is "aaa" at position 2,
but that's not where the match occurred,
since the (?<=b) means "positive lookbehind of the character b",
so the match actually occurred at position 6,
where there is a "b" before the "aaa".

Using regexp_positions(), we can now get the correct answer:

SELECT regexp_positions('caaabaaa', '(?<=b)(a+)', 'g');
regexp_positions
--
{"[6,9)"}
(1 row)

Some more examples from the regress/sql/strings.sql,
showing both regexp_matches() and regexp_positions()
for the same examples, as they both return the same structure,
but with different types:

SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 
'g');
regexp_matches

{bar,beque}
{bazil,barf}
(2 rows)

SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 
'g');
   regexp_positions
---
{"[4,7)","[7,12)"}
{"[12,17)","[17,21)"}
(2 rows)

I've added documentation and tests.

Forgive me for just sending a patch without much discussion on the list,
but it was so easy to implement, so I thought an implementation can
help the discussion on if this is something we want or not.

A few words on the implementation:
I copied build_regexp_match_result() to a new function 
build_regexp_positions_result(),
and removed the string parts, replacing it with code to make ranges instead.
Maybe there are common parts that could be put into some helper-function,
but I think not, since the functions are two small for it to be worth it.

Thanks to David Fetter for the idea on using ranges.

Based on HEAD (f5a5773a9dc4185414fe538525e20d8512c2ba35).

/Joel

0001-regexp-positions.patch
Description: Binary data