Julian Hyde created CALCITE-5914:
------------------------------------

             Summary: Cache compiled regular expressions in SQL function runtime
                 Key: CALCITE-5914
                 URL: https://issues.apache.org/jira/browse/CALCITE-5914
             Project: Calcite
          Issue Type: Bug
            Reporter: Julian Hyde


Cache compiled regular expressions (and other amortized work) in SQL function 
runtime.

Consider the following query:
{code}
SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*')
FROM emp
{code}

The first regular expression, {{A.*}}, is constant and can be compiled at 
prepare time or at the start of execution; the second regular expression '{{job 
|| '.*'}}' might vary from one row to the next. However if the {{job}} column 
has a small number of values it still might be beneficial to cache the compiled 
regular expression.

If {{SqlFunctions.rlike}} could use a cache (mapping from {{String}} to 
{{java.util.regex.Pattern}}) then it would achieve benefits in both the 
constant and non-constant cases.

The cache needs to:
 * be thread-safe (in case queries are executing using multiple threads),
 * return thread-safe objects (as is {{Pattern}}),
 * have bounded space (so that a query doesn't blow memory with 1 million 
distinct regular expressions),
 * disposed after the query has terminated,
 * (ideally) share with regexes of the same language in the same query,
 * not conflict with regexes of different languages in the same query.

One possibility is to add an {{interface FunctionState}}, with subclasses 
including {{class RegexpCache}}, and if argument 1 of a function is a subclass 
of {{FunctionState}} the compiler would initialize the state in the generated 
code. The function can rely on the state argument being initialized, and being 
the same object from one call to the next. Example:

{code}
interface FunctionState {
}

class RegexpCache implements FunctionState {
  final Cache<String, Pattern> cache = ...;
}
{code}

This change should install the cache for all applicable functions, including 
LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_CONTAINS, REGEXP_REPLACE, 
other REGEXP_ functions, PARSE_URL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to