Krinkle added a comment. |
In T173696#3584607, @Krinkle wrote:[..] why do we need to query an external service to execute a regular _expression_?
In T173696#3584821, @Lucas_Werkmeister_WMDE wrote:Because checking arbitrary regexes on arbitrary input opens us up to DoS attacks via malicious regexes [..]. This came up during the security review of the extension (T101467), and we eventually decided to check regexes on an external service instead, where queries are subject to a timeout (T102752).
Interesting. And this uses the same Sparql endpoint at https://query.wikidata.org/ as for public queries? I knew this was called from within Wikidata for MediaWiki API queries, but I didn't know it was used e.g. when saving edits (assuming validation happens there).
I'm bringing this up because the reason we want caching is due to its performance, right? ConstraintCheck/SparqlHelper in Wikidata currently has a timeout of 5 seconds. Presumably in order to support some of the more complex regular expressions and larger input texts. I imagine that with a more performant backend caching might not be as needed.
(off-topic: )
Response times look fairly good in Graphite (source). The minutely p99 over the last 5 days ranges from 100-500ms, although with regular spikes of 1-5 seconds. From manual testing I find it's usually ~350ms with short regex/text, and ~600-1200ms for more larger input. The only responses under 300ms are those that were a cache-hit in Varnish (yay) in which case it's typically 100-130ms. This is okay.
On the other hand, considering its for internal execution of a single regex, ~300ms is a lot. I wonder if something like a "simple" PHP or Python subprocess would work (something that runs preg_match or re.match, using tight firejail with cgroup restrictions, like other MediaWiki subprocesses). Or perhaps using LuaSandbox? (below via eval.php):
$sandbox = new LuaSandbox; $sandbox->setCPULimit(0.25); $sandbox->setMemoryLimit(512000); // 512KB $lua = "function match_text(text, pattern) return string.match(text, pattern) ~= nil end"; $sandbox->loadString($lua)->call(); $result = $sandbox->callFunction('match_text', 'foo', 'f..'); return is_array($result) && isset($result[0]) && $result[0] === true;
These lines take < 1ms. With large input (e.g. wikitext article), and a more complex pattern, between 5 and 10ms. However, I quickly realised Lua doesn't implement PCRE regular expressions by default (T52454, Lua Patterns). I wonder if it's worth revisiting the security implications of adding PCRE bindings to a LuaSandbox (perhaps reduced to a subclass for Wikibase).
Anyway, just a few off topic thoughts :)
Cc: Krinkle, aaron, gerritbot, Ladsgroup, daniel, Aklapper, Jonas, Lucas_Werkmeister_WMDE, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Agabi10, Izno, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs