[MediaWiki-commits] [Gerrit] mediawiki...WikibaseQualityConstraints[master]: Cache regex check results
jenkins-bot has submitted this change and it was merged. ( https://gerrit.wikimedia.org/r/373918 ) Change subject: Cache regex check results .. Cache regex check results SparqlHelper::matchesRegularExpression is changed to be a wrapper around a new function containing the previous code, caching its result. This should hopefully reduce the number of requests to the query service. Currently, there are 1786 format constraints defined on Wikidata, for a total of 77,862,938 format constraint checks on statements. This was determined with a script that may be found in paste P5921, and does not include the statement count for the property P2093 (author name string), which cannot be counted on WDQS. According to SQID, there are 24,150,684 statements for that property [1], so the total count is approximately one hundred million format check results which might be cached. (However, most of these checks will probably never be performed since the corresponding items are rarely visited, and the set of checks that are performed frequently is likely much smaller.) To avoid putting arbitrary user input into the cache key, both regex and text are hashed. If desired, the hash for the regex can be “reversed” with a WDQS query like the following: SELECT ?property ?regex WHERE { ?property p:P2302 [ ps:P2302 wd:Q21502404; pq:P1793 ?regex ]. FILTER(SHA256(?regex) = "...") } When the property is known, the hash for the value can similarly be “reversed”, as long as the property doesn’t have too many statements and isn’t a “Commons link” property (those map to URIs on the query service). [1]: https://tools.wmflabs.org/sqid/#/view?id=P2093 Bug: T173696 Change-Id: Iaaac950b483aaff83aa13b9f1b7d5090cd6c627f --- M includes/ConstraintCheck/Helper/SparqlHelper.php M includes/ConstraintReportFactory.php M tests/phpunit/Helper/SparqlHelperTest.php 3 files changed, 103 insertions(+), 11 deletions(-) Approvals: Aaron Schulz: Looks good to me, approved Krinkle: Looks good to me, but someone else must approve jenkins-bot: Verified diff --git a/includes/ConstraintCheck/Helper/SparqlHelper.php b/includes/ConstraintCheck/Helper/SparqlHelper.php index 13400b4..5bb6e48 100644 --- a/includes/ConstraintCheck/Helper/SparqlHelper.php +++ b/includes/ConstraintCheck/Helper/SparqlHelper.php @@ -6,6 +6,7 @@ use IBufferingStatsdDataFactory; use MediaWiki\MediaWikiServices; use MWHttpRequest; +use WANObjectCache; use Wikibase\DataModel\Entity\EntityIdParser; use Wikibase\DataModel\Entity\EntityIdParsingException; use Wikibase\DataModel\Statement\Statement; @@ -43,6 +44,11 @@ private $entityIdParser; /** +* @var WANObjectCache +*/ + private $cache; + + /** * @var IBufferingStatsdDataFactory */ private $dataFactory; @@ -50,10 +56,12 @@ public function __construct( Config $config, RdfVocabulary $rdfVocabulary, - EntityIdParser $entityIdParser + EntityIdParser $entityIdParser, + WANObjectCache $cache ) { $this->config = $config; $this->entityIdParser = $entityIdParser; + $this->cache = $cache; $this->entityPrefix = $rdfVocabulary->getNamespaceUri( RdfVocabulary::NS_ENTITY ); $this->prefixes = <getWithSetCallback( + $this->cache->makeKey( + 'WikibaseQualityConstraints', // extension + 'regex', // action + 'WDQS-Java', // regex flavor + hash( 'sha256', $regex ), + hash( 'sha256', $text ) + ), + WANObjectCache::TTL_DAY, + function() use ( $text, $regex ) { + $this->dataFactory->increment( 'wikibase.quality.constraints.regex.cachemiss' ); + // convert to int because boolean false is interpreted as value not found + return (int)$this->matchesRegularExpressionWithSparql( $text, $regex ); + }, + [ + // avoid querying cache servers multiple times in a request + // (e. g. when checking format of a reference URL used multiple times on an entity) + 'pcTTL' => WANObjectCache::TTL_PROC_LONG, + ] + ); + } + + /** +* This function is only public for testing purposes; +* use matchesRegularExpression, which is equivalent but caches results. +* +* @param string $text +* @param string $regex +* @return boolean +* @throws
[MediaWiki-commits] [Gerrit] mediawiki...WikibaseQualityConstraints[master]: Cache regex check results
Lucas Werkmeister (WMDE) has uploaded a new change for review. ( https://gerrit.wikimedia.org/r/373918 ) Change subject: Cache regex check results .. Cache regex check results SparqlHelper::matchesRegularExpression is changed to be a wrapper around a new function containing the previous code, caching its result. This should hopefully reduce the number of requests to the query service. Currently, there are 1786 format constraints defined on Wikidata, for a total of 77862938 format constraint checks on statements. This was determined with a script that may be found in paste P5921, and does not include the statement count for the property P2093 (author name string), which cannot be counted on WDQS. According to SQID, there are 24150684 statements for that property [1], so the total count is approximately one hundred million format check results which might be cached. This function calls WANObjectCache::makeKey with arbitrary user input. I assume that’s safe to do, since at least one place in MediaWiki already does that: // ApiStashEdit::execute() $params = $this->extractRequestParams(); // ... $textHash = $params['stashedtexthash']; $textKey = $cache->makeKey( 'stashedit', 'text', $textHash ); [1]: https://tools.wmflabs.org/sqid/#/view?id=P2093 Bug: T173696 Change-Id: Iaaac950b483aaff83aa13b9f1b7d5090cd6c627f --- M includes/ConstraintCheck/Helper/SparqlHelper.php M includes/ConstraintReportFactory.php M tests/phpunit/Helper/SparqlHelperTest.php 3 files changed, 95 insertions(+), 11 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/WikibaseQualityConstraints refs/changes/18/373918/1 diff --git a/includes/ConstraintCheck/Helper/SparqlHelper.php b/includes/ConstraintCheck/Helper/SparqlHelper.php index 13400b4..ff6368e 100644 --- a/includes/ConstraintCheck/Helper/SparqlHelper.php +++ b/includes/ConstraintCheck/Helper/SparqlHelper.php @@ -6,6 +6,7 @@ use IBufferingStatsdDataFactory; use MediaWiki\MediaWikiServices; use MWHttpRequest; +use WANObjectCache; use Wikibase\DataModel\Entity\EntityIdParser; use Wikibase\DataModel\Entity\EntityIdParsingException; use Wikibase\DataModel\Statement\Statement; @@ -43,6 +44,11 @@ private $entityIdParser; /** +* @var WANObjectCache +*/ + private $cache; + + /** * @var IBufferingStatsdDataFactory */ private $dataFactory; @@ -50,10 +56,12 @@ public function __construct( Config $config, RdfVocabulary $rdfVocabulary, - EntityIdParser $entityIdParser + EntityIdParser $entityIdParser, + WANObjectCache $cache ) { $this->config = $config; $this->entityIdParser = $entityIdParser; + $this->cache = $cache; $this->entityPrefix = $rdfVocabulary->getNamespaceUri( RdfVocabulary::NS_ENTITY ); $this->prefixes = <getWithSetCallback( + $this->cache->makeKey( + 'WikibaseQualityConstraints', // extension + 'regex', // action + 'WDQS-Java', // regex flavor + $regex, + $text + ), + WANObjectCache::TTL_INDEFINITE, + function() use ( $text, $regex ) { + // convert to int because boolean false is interpreted as value not found + return (int)$this->matchesRegularExpressionWithSparql( $text, $regex ); + } + ); + } + + /** +* This function is only public for testing purposes; +* use matchesRegularExpression, which is equivalent but caches results. +* +* @param string $text +* @param string $regex +* @return boolean +* @throws SparqlHelperException if the query times out or some other error occurs +* @throws ConstraintParameterException if the $regex is invalid +*/ + public function matchesRegularExpressionWithSparql( $text, $regex ) { $textStringLiteral = $this->stringLiteral( $text ); $regexStringLiteral = $this->stringLiteral( '^' . $regex . '$' ); diff --git a/includes/ConstraintReportFactory.php b/includes/ConstraintReportFactory.php index 9f0046a..322cb24 100644 --- a/includes/ConstraintReportFactory.php +++ b/includes/ConstraintReportFactory.php @@ -208,7 +208,8 @@ $sparqlHelper = new SparqlHelper( $this->config, $this->rdfVocabulary, - $this->entityIdParser +