dcausse moved this task from Waiting to In Progress on the Discovery-Search 
(Current work) board.
dcausse added a comment.


  All the revisions I manually checked were created on this same day 2020-06-12 
before mw1384 was depooled, I'm trying to extract a full list from one server 
but I'm having hard times making blazegraph not fail:
  
    select ?s ?c ?date {
      ?s wdt:P31 ?c .
      FILTER (STRSTARTS(STR(?c), 
"https://www.wikidata.org/wiki/Special:EntityData";))
      ?s schema:dateModified ?date .} limit XX
  
  where XX=20 is the max blazegraph is able to respond without failing with 
`com.bigdata.rwstore.sector.MemoryManagerOutOfMemory` on `wdqs1010`.
  
  Moving back to in progress as the journal of most servers seem corrupted with 
such data and we need to either confirm or discard T255282 
<https://phabricator.wikimedia.org/T255282> (a single occurrence of this 
incoherence for a revision not created on 2020-06-12 before 17:00 UTC would 
discard this possibility) but more importantly cleanup the data.
  
  Regarding how this could happen from the wdqs-updater perspective:
  
  When parsing the item RDF data the updater will uses this URI construct 
`https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106`
 as the baseURI for the sesame RIO parser.
  
    
    StatementCollector collector = new StatementCollector();
    RDFParser parser = RDFParserSuppliers.defaultRdfParser().get(collector);
    String baseUri = 
"https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106";;
    parser.parse(new StringReader("<uri:subject> <uri:pred> <> ."), baseUri);
    RDFWriter writer = 
RDFWriterRegistry.getInstance().get(RDFFormat.TURTLE).getWriter(System.out);
    writer.startRDF();
    for (Statement st : collector.getStatements()){
        writer.handleStatement(st);
    }
    writer.endRDF();
  
  Will interpret the turtle:
  
    <uri:subject> <uri:pred> <> .
  
  as
  
    <uri:subject> <uri:pred> 
<https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106>
 .
  
  On the PHP side the notice mentionned in T255282 
<https://phabricator.wikimedia.org/T255282> indicates that `trim() expects 
parameter 1 to be string, object given`.
  `trim` will return NULL when given an object which is then passed to 
`\Wikimedia\Purtle\RdfWriter::is( $base, $local )` which will output `<>` when 
given NULL for both args:
  
    $writer = new \Wikimedia\Purtle\TurtleRdfWriter();
    $writer->start();
    $writer->about( "uri", "subject" );
    $writer->say( "uri", "predicate" )->is( null );
    $writer->finish();
    print( $writer->drain() );
  
  will output:
  
    uri:subject uri:predicate <> .
  
  It is very probable that the notices seen in T255282 
<https://phabricator.wikimedia.org/T255282> have caused such triples to be 
written in the ttl output of Special:EntityData.

TASK DETAIL
  https://phabricator.wikimedia.org/T255657

WORKBOARD
  https://phabricator.wikimedia.org/project/board/1227/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: VladimirAlexiev, dcausse, Nikki, Lucas_Werkmeister_WMDE, Aklapper, 
Epidosis, CBogen, Akuckartz, darthmon_wmde, Nandana, Namenlos314, Lahi, Gq86, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to