dcausse moved this task from Waiting to In Progress on the Discovery-Search
(Current work) board.
dcausse added a comment.
All the revisions I manually checked were created on this same day 2020-06-12
before mw1384 was depooled, I'm trying to extract a full list from one server
but I'm having hard times making blazegraph not fail:
select ?s ?c ?date {
?s wdt:P31 ?c .
FILTER (STRSTARTS(STR(?c),
"https://www.wikidata.org/wiki/Special:EntityData"))
?s schema:dateModified ?date .} limit XX
where XX=20 is the max blazegraph is able to respond without failing with
`com.bigdata.rwstore.sector.MemoryManagerOutOfMemory` on `wdqs1010`.
Moving back to in progress as the journal of most servers seem corrupted with
such data and we need to either confirm or discard T255282
<https://phabricator.wikimedia.org/T255282> (a single occurrence of this
incoherence for a revision not created on 2020-06-12 before 17:00 UTC would
discard this possibility) but more importantly cleanup the data.
Regarding how this could happen from the wdqs-updater perspective:
When parsing the item RDF data the updater will uses this URI construct
`https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106`
as the baseURI for the sesame RIO parser.
StatementCollector collector = new StatementCollector();
RDFParser parser = RDFParserSuppliers.defaultRdfParser().get(collector);
String baseUri =
"https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106";
parser.parse(new StringReader("<uri:subject> <uri:pred> <> ."), baseUri);
RDFWriter writer =
RDFWriterRegistry.getInstance().get(RDFFormat.TURTLE).getWriter(System.out);
writer.startRDF();
for (Statement st : collector.getStatements()){
writer.handleStatement(st);
}
writer.endRDF();
Will interpret the turtle:
<uri:subject> <uri:pred> <> .
as
<uri:subject> <uri:pred>
<https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106>
.
On the PHP side the notice mentionned in T255282
<https://phabricator.wikimedia.org/T255282> indicates that `trim() expects
parameter 1 to be string, object given`.
`trim` will return NULL when given an object which is then passed to
`\Wikimedia\Purtle\RdfWriter::is( $base, $local )` which will output `<>` when
given NULL for both args:
$writer = new \Wikimedia\Purtle\TurtleRdfWriter();
$writer->start();
$writer->about( "uri", "subject" );
$writer->say( "uri", "predicate" )->is( null );
$writer->finish();
print( $writer->drain() );
will output:
uri:subject uri:predicate <> .
It is very probable that the notices seen in T255282
<https://phabricator.wikimedia.org/T255282> have caused such triples to be
written in the ttl output of Special:EntityData.
TASK DETAIL
https://phabricator.wikimedia.org/T255657
WORKBOARD
https://phabricator.wikimedia.org/project/board/1227/
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dcausse
Cc: VladimirAlexiev, dcausse, Nikki, Lucas_Werkmeister_WMDE, Aklapper,
Epidosis, CBogen, Akuckartz, darthmon_wmde, Nandana, Namenlos314, Lahi, Gq86,
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper,
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984,
Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs