Re: Test harness for regex code (to allow importing Tcl's test suite)

2021-01-17 Thread Tom Lane
Alexander Lakhin  writes:
> select * from test_regex(repeat('(x)', 32), 'a', 'c');
> leads to
> ==00:00:00:05.736 2605072== Invalid write of size 4

Indeed, silly oversight on my part.  Fixed, thanks!

regards, tom lane




Re: Test harness for regex code (to allow importing Tcl's test suite)

2021-01-16 Thread Alexander Lakhin
Hello Tom,
04.01.2021 08:47, Tom Lane wrote:
> Hm.  There isn't anything about test_regex()'s API that I would want
> to expose to end users ;-) ... it's just designed to replicate the
> test API that Henry Spencer designed a couple decades ago, which IMO
> was not too clean even by the standards of the time.
As test_regex() is not meant to be public, maybe this is of little
importance, but I've found another bug when exploiting the new test module:
select * from test_regex(repeat('(x)', 32), 'a', 'c');
leads to
==00:00:00:05.736 2605072== Invalid write of size 4
==00:00:00:05.736 2605072==    at 0x4866D09: setup_test_matches
(test_regex.c:564)
==00:00:00:05.736 2605072==    by 0x4867276: test_regex (test_regex.c:105)
==00:00:00:05.736 2605072==    by 0x37B10C: ExecMakeTableFunctionResult
(execSRF.c:234)
==00:00:00:05.736 2605072==    by 0x38AAA4: FunctionNext
(nodeFunctionscan.c:95)
==00:00:00:05.736 2605072==    by 0x37BA68: ExecScanFetch (execScan.c:133)
==00:00:00:05.736 2605072==    by 0x37BB05: ExecScan (execScan.c:182)
==00:00:00:05.736 2605072==    by 0x38A9CE: ExecFunctionScan
(nodeFunctionscan.c:270)
==00:00:00:05.736 2605072==    by 0x378E5D: ExecProcNodeFirst
(execProcnode.c:450)
==00:00:00:05.736 2605072==    by 0x372CBC: ExecProcNode (executor.h:247)
==00:00:00:05.736 2605072==    by 0x372CBC: ExecutePlan (execMain.c:1542)
==00:00:00:05.736 2605072==    by 0x372E22: standard_ExecutorRun
(execMain.c:364)
==00:00:00:05.736 2605072==    by 0x372EF2: ExecutorRun (execMain.c:308)
==00:00:00:05.736 2605072==    by 0x4E2B77: PortalRunSelect (pquery.c:912)
==00:00:00:05.736 2605072==  Address 0xee4d5ec is 908 bytes inside a
block of size 1,024 alloc'd
==00:00:00:05.736 2605072==    at 0x483B7F3: malloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==00:00:00:05.736 2605072==    by 0x62239C: AllocSetAlloc (aset.c:919)
==00:00:00:05.736 2605072==    by 0x629656: palloc0 (mcxt.c:995)
==00:00:00:05.736 2605072==    by 0x4866A06: setup_test_matches
(test_regex.c:450)
==00:00:00:05.736 2605072==    by 0x4867276: test_regex (test_regex.c:105)
==00:00:00:05.736 2605072==    by 0x37B10C: ExecMakeTableFunctionResult
(execSRF.c:234)
==00:00:00:05.736 2605072==    by 0x38AAA4: FunctionNext
(nodeFunctionscan.c:95)
==00:00:00:05.736 2605072==    by 0x37BA68: ExecScanFetch (execScan.c:133)
==00:00:00:05.736 2605072==    by 0x37BB05: ExecScan (execScan.c:182)
==00:00:00:05.736 2605072==    by 0x38A9CE: ExecFunctionScan
(nodeFunctionscan.c:270)
==00:00:00:05.736 2605072==    by 0x378E5D: ExecProcNodeFirst
(execProcnode.c:450)
==00:00:00:05.736 2605072==    by 0x372CBC: ExecProcNode (executor.h:247)
==00:00:00:05.736 2605072==    by 0x372CBC: ExecutePlan (execMain.c:1542)
==00:00:00:05.736 2605072==

Best regards,
Alexander




Re: Test harness for regex code (to allow importing Tcl's test suite)

2021-01-04 Thread Joel Jacobson
On Mon, Jan 4, 2021, at 04:49, Tom Lane wrote:
>Over the holiday break I've been fooling with some regex performance
>improvements.

Cool! I've also been fooling with regex performance over the years myself, not 
in the PostgreSQL code, but in general.

More specifically, to first DFA-minimize the regex,
and then to generate LLVMIR directly from the graph.

Perhaps some of the ideas could be interesting to look at.

Here is a live demo:  https://compiler.org/reason-re-nfa/src/index.html

One idea that I came up with myself is the "merge_linear" step,
where when possible, multiple characters are read in the same operation.
Not sure if other regex JIT engines does this, but it makes quite a difference
for large regexes where you have long strings.

Note: There is no support for capture groups, back-references, etc, but | + * 
() [] [^] works.

/Joel

Re: Test harness for regex code (to allow importing Tcl's test suite)

2021-01-03 Thread Tom Lane
Mark Dilger  writes:
> Have you thought about whether If it weren't in test/modules, it might be 
> nice to expose test_regex from SQL with a slightly different interface that 
> doesn't throw on regex compilation error?  Maybe something in contrib?  It 
> might be useful to some users to validate regular expressions.  I'm just 
> asking... I don't have any problem with how you have it here.

Hm.  There isn't anything about test_regex()'s API that I would want
to expose to end users ;-) ... it's just designed to replicate the
test API that Henry Spencer designed a couple decades ago, which IMO
was not too clean even by the standards of the time.

However, I can see the argument for a function along the lines of
"validate_regex(text) returns bool".  If you or somebody else wants to
write that, independently of this patch, I think it'd be a fine idea.

regards, tom lane




Re: Test harness for regex code (to allow importing Tcl's test suite)

2021-01-03 Thread Mark Dilger



> On Jan 3, 2021, at 7:49 PM, Tom Lane  wrote:
> 
> Over the holiday break I've been fooling with some regex performance
> improvements.  I don't have anything ready to show yet in that line,
> but I was feeling the need for more-thorough test coverage, so I set
> to work on something that's been in the back of my mind for a long
> time: we need to absorb the test cases that Henry Spencer wrote way-
> back-when for that code, which up to now existed only as a script
> in the Tcl test suite.  That state of affairs seemed OK to me in
> the beginning, when we thought of Tcl as the upstream for that code,
> and figured they'd vet any nontrivial changes.  They've pretty
> thoroughly dropped the ball on that though, and indeed I think that
> they now believe *we're* the upstream.  So we need to have a test
> suite that reflects that status, at least to the extent of running
> all the test cases that exist for that code.
> 
> Accordingly, here's a new src/test/modules package that creates a
> function modeled on regexp_matches(), but using a set of flags that
> matches Spencer's design for the Tcl test suite, allowing parts
> of src/backend/regex/ to be tested that can't be reached with our
> existing SQL-exposed functions.  The test scripts in the module
> reproduce all the tests in Tcl's "tests/reg.test" script as of
> Tcl 8.6.10, plus a few others that I felt advisable, such as tests
> for the lookbehind constraints we added a few years ago.  (Note:
> Tcl also has regexp.test and regexpComp.test, but those seem to be
> oriented towards testing their language-specific wrappers not the
> regex engine itself.)
> 
> According to my testing, this increases our test code coverage for
> src/backend/regex/ from 71.1% to 86.7%, which is not too shabby,
> especially seeing that a lot of the remainder is not-deterministically-
> reachable code for malloc failure handling.
> 
> Thoughts?  Is anyone interested in reviewing this?  Since it's only
> test code, I'd be okay with pushing it without review, but I'd be
> happy if someone else wants to look at it.

I've quickly read this over and generally like it.  Thanks for working on this!

Have you thought about whether If it weren't in test/modules, it might be nice 
to expose test_regex from SQL with a slightly different interface that doesn't 
throw on regex compilation error?  Maybe something in contrib?  It might be 
useful to some users to validate regular expressions.  I'm just asking... I 
don't have any problem with how you have it here.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company