Ladsgroup created this task.
Ladsgroup added a project: Wikidata.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  Very WIP
  ========
  
  Problem statement
  =================
  
  Wikidata validates statements against the constraints from the associated 
property. The constraints are user-defined and one of the possible constraint 
types for text values is a regex pattern.
  
  Due to the impact of potentially malicious regexes, the MediaWiki PHP backend 
for Wikidata does not use PHP's preg_match. Instead, we need to isolate this in 
some way.
  
  The current workaround uses the SPARQL query service, which incurs a lot of 
overhead (ping, TCP, HTTP, SPARQL parsing, query engine preparation), which 
results in bad timing of the format constraint even for benign regexes. We 
should investigate whether we can check regexes more locally. However, the 
mechanism should be tightly restricted in order to avoid denial-of-service 
attacks via malicious regexes.
  
  Previous discussions
  ====================
  
  - T176312: Don’t check format constraint via SPARQL (safely evaluating 
user-provided regular expressions) <https://phabricator.wikimedia.org/T176312>
  
  Alternative ideas
  -----------------
  
  - Evaluating the regex using Lua
    - Pros: No need for another service, easier to implement, more performant 
due to lack of network roundtrips
    - Cons: Lua doesn't implement most of PCRE features: T176312#3625405 
<https://phabricator.wikimedia.org/T176312#3625405>
  - PHP program called as sub process within a Firejail.
    - Pros: More performant due to lack of network roundtrips
    - Cons: It looks iffy
  - Using re2:
    - Pros: It works...
    - Cons: ... partially. Still needs php binding or service.
  
  Proposed solution
  =================
  
  The proposed solution is to have a stand-alone service sandboxed for 
evaluating user-provided regex accessible using gRPC from the rest of 
infrastructure including mediawiki nodes.
  
  A possible client-side design would be something like this:
  
    $client = new GrpcRegexClient('node-server:9090', [
        'credentials' => Grpc\ChannelCredentials::createInsecure(),
    ]);
    
    $request = new GrpcRegexRequest();
    $request->setRegex( $userProvidedRegex );
    $request->setText( $userProvidedText );
    list($response, $status) = $client->Evaluate( $request )->wait();
    echo $response->getResult()."\n";
  
  And on the server-side we will have a service (either by nodejs, php, python, 
that doesn't matter) that takes regex and string and evaluate it (maybe there's 
some codebase for this already?)

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ladsgroup
Cc: Lucas_Werkmeister_WMDE, Addshore, Aklapper, Ladsgroup, darthmon_wmde, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to