[MediaWiki-commits] [Gerrit] mediawiki...WikibaseQualityConstraints[master]: Cache regex check results

2017-09-25 Thread jenkins-bot (Code Review)
jenkins-bot has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/373918 )

Change subject: Cache regex check results
..


Cache regex check results

SparqlHelper::matchesRegularExpression is changed to be a wrapper around
a new function containing the previous code, caching its result. This
should hopefully reduce the number of requests to the query service.

Currently, there are 1786 format constraints defined on Wikidata, for a
total of 77,862,938 format constraint checks on statements. This was
determined with a script that may be found in paste P5921, and does not
include the statement count for the property P2093 (author name string),
which cannot be counted on WDQS. According to SQID, there are 24,150,684
statements for that property [1], so the total count is approximately
one hundred million format check results which might be cached.
(However, most of these checks will probably never be performed since
the corresponding items are rarely visited, and the set of checks that
are performed frequently is likely much smaller.)

To avoid putting arbitrary user input into the cache key, both regex and
text are hashed. If desired, the hash for the regex can be “reversed”
with a WDQS query like the following:

SELECT ?property ?regex WHERE {
  ?property p:P2302 [
ps:P2302 wd:Q21502404;
pq:P1793 ?regex
  ].
  FILTER(SHA256(?regex) = "...")
}

When the property is known, the hash for the value can similarly be
“reversed”, as long as the property doesn’t have too many statements and
isn’t a “Commons link” property (those map to URIs on the query
service).

[1]: https://tools.wmflabs.org/sqid/#/view?id=P2093

Bug: T173696
Change-Id: Iaaac950b483aaff83aa13b9f1b7d5090cd6c627f
---
M includes/ConstraintCheck/Helper/SparqlHelper.php
M includes/ConstraintReportFactory.php
M tests/phpunit/Helper/SparqlHelperTest.php
3 files changed, 103 insertions(+), 11 deletions(-)

Approvals:
  Aaron Schulz: Looks good to me, approved
  Krinkle: Looks good to me, but someone else must approve
  jenkins-bot: Verified



diff --git a/includes/ConstraintCheck/Helper/SparqlHelper.php 
b/includes/ConstraintCheck/Helper/SparqlHelper.php
index 13400b4..5bb6e48 100644
--- a/includes/ConstraintCheck/Helper/SparqlHelper.php
+++ b/includes/ConstraintCheck/Helper/SparqlHelper.php
@@ -6,6 +6,7 @@
 use IBufferingStatsdDataFactory;
 use MediaWiki\MediaWikiServices;
 use MWHttpRequest;
+use WANObjectCache;
 use Wikibase\DataModel\Entity\EntityIdParser;
 use Wikibase\DataModel\Entity\EntityIdParsingException;
 use Wikibase\DataModel\Statement\Statement;
@@ -43,6 +44,11 @@
private $entityIdParser;
 
/**
+* @var WANObjectCache
+*/
+   private $cache;
+
+   /**
 * @var IBufferingStatsdDataFactory
 */
private $dataFactory;
@@ -50,10 +56,12 @@
public function __construct(
Config $config,
RdfVocabulary $rdfVocabulary,
-   EntityIdParser $entityIdParser
+   EntityIdParser $entityIdParser,
+   WANObjectCache $cache
) {
$this->config = $config;
$this->entityIdParser = $entityIdParser;
+   $this->cache = $cache;
 
$this->entityPrefix = $rdfVocabulary->getNamespaceUri( 
RdfVocabulary::NS_ENTITY );
$this->prefixes = <getWithSetCallback(
+   $this->cache->makeKey(
+   'WikibaseQualityConstraints', // extension
+   'regex', // action
+   'WDQS-Java', // regex flavor
+   hash( 'sha256', $regex ),
+   hash( 'sha256', $text )
+   ),
+   WANObjectCache::TTL_DAY,
+   function() use ( $text, $regex ) {
+   $this->dataFactory->increment( 
'wikibase.quality.constraints.regex.cachemiss' );
+   // convert to int because boolean false is 
interpreted as value not found
+   return 
(int)$this->matchesRegularExpressionWithSparql( $text, $regex );
+   },
+   [
+   // avoid querying cache servers multiple times 
in a request
+   // (e. g. when checking format of a reference 
URL used multiple times on an entity)
+   'pcTTL' => WANObjectCache::TTL_PROC_LONG,
+   ]
+   );
+   }
+
+   /**
+* This function is only public for testing purposes;
+* use matchesRegularExpression, which is equivalent but caches results.
+*
+* @param string $text
+* @param string $regex
+* @return boolean
+* @throws 

[MediaWiki-commits] [Gerrit] mediawiki...WikibaseQualityConstraints[master]: Cache regex check results

2017-08-25 Thread Lucas Werkmeister (WMDE) (Code Review)
Lucas Werkmeister (WMDE) has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/373918 )

Change subject: Cache regex check results
..

Cache regex check results

SparqlHelper::matchesRegularExpression is changed to be a wrapper around
a new function containing the previous code, caching its result. This
should hopefully reduce the number of requests to the query service.

Currently, there are 1786 format constraints defined on Wikidata, for a
total of 77862938 format constraint checks on statements. This was
determined with a script that may be found in paste P5921, and does not
include the statement count for the property P2093 (author name string),
which cannot be counted on WDQS. According to SQID, there are 24150684
statements for that property [1], so the total count is approximately
one hundred million format check results which might be cached.

This function calls WANObjectCache::makeKey with arbitrary user input. I
assume that’s safe to do, since at least one place in MediaWiki already
does that:

// ApiStashEdit::execute()
$params = $this->extractRequestParams();
// ...
$textHash = $params['stashedtexthash'];
$textKey = $cache->makeKey( 'stashedit', 'text', $textHash );

[1]: https://tools.wmflabs.org/sqid/#/view?id=P2093

Bug: T173696
Change-Id: Iaaac950b483aaff83aa13b9f1b7d5090cd6c627f
---
M includes/ConstraintCheck/Helper/SparqlHelper.php
M includes/ConstraintReportFactory.php
M tests/phpunit/Helper/SparqlHelperTest.php
3 files changed, 95 insertions(+), 11 deletions(-)


  git pull 
ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/WikibaseQualityConstraints
 refs/changes/18/373918/1

diff --git a/includes/ConstraintCheck/Helper/SparqlHelper.php 
b/includes/ConstraintCheck/Helper/SparqlHelper.php
index 13400b4..ff6368e 100644
--- a/includes/ConstraintCheck/Helper/SparqlHelper.php
+++ b/includes/ConstraintCheck/Helper/SparqlHelper.php
@@ -6,6 +6,7 @@
 use IBufferingStatsdDataFactory;
 use MediaWiki\MediaWikiServices;
 use MWHttpRequest;
+use WANObjectCache;
 use Wikibase\DataModel\Entity\EntityIdParser;
 use Wikibase\DataModel\Entity\EntityIdParsingException;
 use Wikibase\DataModel\Statement\Statement;
@@ -43,6 +44,11 @@
private $entityIdParser;
 
/**
+* @var WANObjectCache
+*/
+   private $cache;
+
+   /**
 * @var IBufferingStatsdDataFactory
 */
private $dataFactory;
@@ -50,10 +56,12 @@
public function __construct(
Config $config,
RdfVocabulary $rdfVocabulary,
-   EntityIdParser $entityIdParser
+   EntityIdParser $entityIdParser,
+   WANObjectCache $cache
) {
$this->config = $config;
$this->entityIdParser = $entityIdParser;
+   $this->cache = $cache;
 
$this->entityPrefix = $rdfVocabulary->getNamespaceUri( 
RdfVocabulary::NS_ENTITY );
$this->prefixes = <getWithSetCallback(
+   $this->cache->makeKey(
+   'WikibaseQualityConstraints', // extension
+   'regex', // action
+   'WDQS-Java', // regex flavor
+   $regex,
+   $text
+   ),
+   WANObjectCache::TTL_INDEFINITE,
+   function() use ( $text, $regex ) {
+   // convert to int because boolean false is 
interpreted as value not found
+   return 
(int)$this->matchesRegularExpressionWithSparql( $text, $regex );
+   }
+   );
+   }
+
+   /**
+* This function is only public for testing purposes;
+* use matchesRegularExpression, which is equivalent but caches results.
+*
+* @param string $text
+* @param string $regex
+* @return boolean
+* @throws SparqlHelperException if the query times out or some other 
error occurs
+* @throws ConstraintParameterException if the $regex is invalid
+*/
+   public function matchesRegularExpressionWithSparql( $text, $regex ) {
$textStringLiteral = $this->stringLiteral( $text );
$regexStringLiteral = $this->stringLiteral( '^' . $regex . '$' 
);
 
diff --git a/includes/ConstraintReportFactory.php 
b/includes/ConstraintReportFactory.php
index 9f0046a..322cb24 100644
--- a/includes/ConstraintReportFactory.php
+++ b/includes/ConstraintReportFactory.php
@@ -208,7 +208,8 @@
$sparqlHelper = new SparqlHelper(
$this->config,
$this->rdfVocabulary,
-   $this->entityIdParser
+