dcausse moved this task from Waiting to In Progress on the Discovery-Search (Current work) board. dcausse added a comment.
All the revisions I manually checked were created on this same day 2020-06-12 before mw1384 was depooled, I'm trying to extract a full list from one server but I'm having hard times making blazegraph not fail: select ?s ?c ?date { ?s wdt:P31 ?c . FILTER (STRSTARTS(STR(?c), "https://www.wikidata.org/wiki/Special:EntityData")) ?s schema:dateModified ?date .} limit XX where XX=20 is the max blazegraph is able to respond without failing with `com.bigdata.rwstore.sector.MemoryManagerOutOfMemory` on `wdqs1010`. Moving back to in progress as the journal of most servers seem corrupted with such data and we need to either confirm or discard T255282 <https://phabricator.wikimedia.org/T255282> (a single occurrence of this incoherence for a revision not created on 2020-06-12 before 17:00 UTC would discard this possibility) but more importantly cleanup the data. Regarding how this could happen from the wdqs-updater perspective: When parsing the item RDF data the updater will uses this URI construct `https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106` as the baseURI for the sesame RIO parser. StatementCollector collector = new StatementCollector(); RDFParser parser = RDFParserSuppliers.defaultRdfParser().get(collector); String baseUri = "https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106"; parser.parse(new StringReader("<uri:subject> <uri:pred> <> ."), baseUri); RDFWriter writer = RDFWriterRegistry.getInstance().get(RDFFormat.TURTLE).getWriter(System.out); writer.startRDF(); for (Statement st : collector.getStatements()){ writer.handleStatement(st); } writer.endRDF(); Will interpret the turtle: <uri:subject> <uri:pred> <> . as <uri:subject> <uri:pred> <https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106> . On the PHP side the notice mentionned in T255282 <https://phabricator.wikimedia.org/T255282> indicates that `trim() expects parameter 1 to be string, object given`. `trim` will return NULL when given an object which is then passed to `\Wikimedia\Purtle\RdfWriter::is( $base, $local )` which will output `<>` when given NULL for both args: $writer = new \Wikimedia\Purtle\TurtleRdfWriter(); $writer->start(); $writer->about( "uri", "subject" ); $writer->say( "uri", "predicate" )->is( null ); $writer->finish(); print( $writer->drain() ); will output: uri:subject uri:predicate <> . It is very probable that the notices seen in T255282 <https://phabricator.wikimedia.org/T255282> have caused such triples to be written in the ttl output of Special:EntityData. TASK DETAIL https://phabricator.wikimedia.org/T255657 WORKBOARD https://phabricator.wikimedia.org/project/board/1227/ EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse Cc: VladimirAlexiev, dcausse, Nikki, Lucas_Werkmeister_WMDE, Aklapper, Epidosis, CBogen, Akuckartz, darthmon_wmde, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs