Krinkle added a comment.

[..] why do we need to query an external service to execute a regular _expression_?

Because checking arbitrary regexes on arbitrary input opens us up to DoS attacks via malicious regexes [..]. This came up during the security review of the extension (T101467), and we eventually decided to check regexes on an external service instead, where queries are subject to a timeout (T102752).

Interesting. And this uses the same Sparql endpoint at https://query.wikidata.org/ as for public queries? I knew this was called from within Wikidata for MediaWiki API queries, but I didn't know it was used e.g. when saving edits (assuming validation happens there).

I'm bringing this up because the reason we want caching is due to its performance, right? ConstraintCheck/SparqlHelper in Wikidata currently has a timeout of 5 seconds. Presumably in order to support some of the more complex regular expressions and larger input texts. I imagine that with a more performant backend caching might not be as needed.

(off-topic: )

Response times look fairly good in Graphite (source). The minutely p99 over the last 5 days ranges from 100-500ms, although with regular spikes of 1-5 seconds. From manual testing I find it's usually ~350ms with short regex/text, and ~600-1200ms for more larger input. The only responses under 300ms are those that were a cache-hit in Varnish (yay) in which case it's typically 100-130ms. This is okay.

On the other hand, considering its for internal execution of a single regex, ~300ms is a lot. I wonder if something like a "simple" PHP or Python subprocess would work (something that runs preg_match or re.match, using tight firejail with cgroup restrictions, like other MediaWiki subprocesses). Or perhaps using LuaSandbox? (below via eval.php):

$sandbox = new LuaSandbox; $sandbox->setCPULimit(0.25); $sandbox->setMemoryLimit(512000); // 512KB
$lua = "function match_text(text, pattern) return string.match(text, pattern) ~= nil end";
$sandbox->loadString($lua)->call(); $result = $sandbox->callFunction('match_text', 'foo', 'f..');
return is_array($result) && isset($result[0]) && $result[0] === true;

These lines take < 1ms. With large input (e.g. wikitext article), and a more complex pattern, between 5 and 10ms. However, I quickly realised Lua doesn't implement PCRE regular expressions by default (T52454, Lua Patterns). I wonder if it's worth revisiting the security implications of adding PCRE bindings to a LuaSandbox (perhaps reduced to a subclass for Wikibase).

Anyway, just a few off topic thoughts :)


TASK DETAIL
https://phabricator.wikimedia.org/T173696

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Krinkle
Cc: Krinkle, aaron, gerritbot, Ladsgroup, daniel, Aklapper, Jonas, Lucas_Werkmeister_WMDE, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Agabi10, Izno, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to