Dear Laurence,

Thank you for your prompt and insightful response!
Your answers will certainly be very helpful for our team.

Best Regards,

Elton F. de S. Soares
Advisory Software Engineer
Rio de Janeiro, RJ, Brazil
IBM Research
E-mail: [email protected]<mailto:[email protected]>


From: Laurence Parry <[email protected]>
Date: Thursday, 1 February 2024 at 20:43
To: Wikibase Community User Group <[email protected]>, 
[email protected] <[email protected]>
Cc: Renato F Maia <[email protected]>, Guilherme Lima 
<[email protected]>, Leonardo Guerreiro Azevedo <[email protected]>, Marcelo 
O C Machado <[email protected]>, Joao Marcello Bessa Rodrigues 
<[email protected]>, Raphael Melo Thiago <[email protected]>, Elton 
Figueiredo de Souza Soares <[email protected]>
Subject: [EXTERNAL] Re: Wikibase/Wikidata Database Technologies and Strategies
Dear Elton (and others), Wikibase uses the main MediaWiki database (which is 
normally MySQL/MariaDB but may be PostgreSQL or SQLite - see Special: Version) 
to store data about entities. They are stored as JSON blobs in a custom slot 
type as
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
<https://us-phishalarm-ewt.proofpoint.com/EWT/v1/PjiDSg!1a-tDZ5zRvma9Yv7-IGljqkJvZSMUfbNB8t501hIV-aPNWilPr4TfGoaHddQ6JsEc8IFT5evXJJocldic5Z9gBlC8yB1728RMyMIHFjATR8PcC6gx1-OPU6r_S0nKFkM9raKeQ8RKw$>
Report Suspicious 
<https://us-phishalarm-ewt.proofpoint.com/EWT/v1/PjiDSg!1a-tDZ5zRvma9Yv7-IGljqkJvZSMUfbNB8t501hIV-aPNWilPr4TfGoaHddQ6JsEc8IFT5evXJJocldic5Z9gBlC8yB1728RMyMIHFjATR8PcC6gx1-OPU6r_S0nKFkM9raKeQ8RKw$>


ZjQcmQRYFpfptBannerEnd
Dear Elton (and others),

Wikibase uses the main MediaWiki database (which is normally MySQL/MariaDB but 
may be PostgreSQL or SQLite - see Special:Version) to store data about 
entities. They are stored as JSON blobs in a custom slot type as the primary 
content of the pages in certain namespaces. Examples of a particular entity:
https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.json<https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.json>
 (similar to what Wikibase might store)
https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.ttl<https://furry.wikibase.cloud/wiki/Special:EntityData/Q4.ttl>
 (close to the triples that WDQS might consume - it might prefer the rdf form)
https://furry.wikibase.cloud/wiki/Item:Q4<https://furry.wikibase.cloud/wiki/Item:Q4>
 (WebUI)

The storage system is quite complicated because MediaWiki has come to store 
many different types of content and many revisions of it, but see
https://mediawiki.org/wiki/Manual:Database_layout<https://mediawiki.org/wiki/Manual:Database_layout>
https://mediawiki.org/wiki/Multi-Content_Revisions<https://mediawiki.org/wiki/Multi-Content_Revisions>
https://mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema<https://mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>
https://mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data<https://mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data>

Secondary Wikibase storage is used in the form of tables, containing what in 
some ways might be seen as a view or indexes on the JSON blobs in the pages 
which constitute primary storage. It is updated in a deferred manner. See the 
schemas at 
https://mediawiki.org/wiki/Wikibase/Schema<https://mediawiki.org/wiki/Wikibase/Schema>
and 
https://doc.wikimedia.org/Wikibase/master/php/docs_topics_storage.html<https://doc.wikimedia.org/Wikibase/master/php/docs_topics_storage.html>
 and explanation at
https://doc.wikimedia.org/Wikibase/master/php/docs_storage_terms.html<https://doc.wikimedia.org/Wikibase/master/php/docs_storage_terms.html>

(Wikimedia Commons uses a secondary slot to store a separate Mediainfo entity 
type: 
https://mediawiki.org/wiki/Extension:WikibaseMediaInfo<https://mediawiki.org/wiki/Extension:WikibaseMediaInfo>
 - 
https://mediawiki.org/wiki/Extension:WikibaseMediaInfo/Data_Model/JSON<https://mediawiki.org/wiki/Extension:WikibaseMediaInfo/Data_Model/JSON>
 )

A separate graph database called Blazegraph is used as the storage component 
for the Wikidata Query Service (WDQS), a Java-centric system which consists of 
repository, updater and proxy components, plus a web front end.
https://mediawiki.org/wiki/Wikidata_Query_Service/User_Manual<https://mediawiki.org/wiki/Wikidata_Query_Service/User_Manual>
https://mediawiki.org/wiki/Wikidata_Query_Service/Implementation<https://mediawiki.org/wiki/Wikidata_Query_Service/Implementation>
https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service<https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service>

The updater either reads the MediaWiki Recent changes feed to identify pages 
that it needs to access and retrieve triples from by accessing the entity data 
endpoints (there is a link for this in the source of the main page as well; 
these triple view formats are provided by the Wikibase extension), or it is fed 
changes through a Kafka-mediated Flink updater used on Wikidata: 
https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater<https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater>
(I am not sure if others have implemented this - the distribution's updater 
uses recent changes.)

It is possible that some other data is cached within Wikibase using MediaWiki's 
cache solution but the most common caching is that of rendered pages. Perhaps a 
developer who knows the details could chime in w.r.t. your question; failing 
that, the source code is available, if complex.

Data may also be indexed within an attached ElasticSearch (Cirrus) index which 
if provisioned can hook into Wikibase and WDQS to enable search features which 
in some cases may be more efficient than a Blazegraph query or access data not 
stored in tuples:
https://mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch<https://mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch>

Otherwise, yes, WDQS is the main complex query interface (though Blazegraph 
might not be the only store for it in the future, its lack of ongoing external 
support makes that problematic - the team was bought out by AWS to create 
Neptune).

There is to my knowledge no pagination of statements, in the sense that you 
access all statements at once even if they may be consumed one at a time within 
e.g. Lua. This is why accessing an entity is considered expensive.

For this reason it is an inefficient anti-pattern to have thousands of 
statements on a single entity, especially if editing them actively, as there 
will be a lot of serialisation going on - it is likely better to have thousands 
of finely-divided entities.

The REST API is as far as I know just a different way to access the main 
MediaWiki component vs. the Action API and isn't a standalone system.

This maybe doesn't answer all your questions but hopefully it helps.

Best regards,
--
Laurence 'GreenReaper' Parry
Wikibase Community User Group
https://GreenReaper.co.uk<https://GreenReaper.co.uk> - 
https://wikifur.com<https://wikifur.com>

________________________________
From: Elton Figueiredo de Souza Soares via Wikibase Community User Group 
<[email protected]>
Sent: Thursday, February 1, 2024 10:20:10 pm
To: [email protected] <[email protected]>; 
[email protected] <[email protected]>
Cc: Renato F Maia <[email protected]>; Guilherme Lima 
<[email protected]>; Leonardo Guerreiro Azevedo <[email protected]>; Marcelo 
O C Machado <[email protected]>; Joao Marcello Bessa Rodrigues 
<[email protected]>; Raphael Melo Thiago <[email protected]>; Elton 
Figueiredo de Souza Soares <[email protected]>
Subject: [Wikibase] Wikibase/Wikidata Database Technologies and Strategies

Dear Wikibase/Wikidata Community,

We are trying to understand which database technologies and strategies 
Wikibase/Wikidata uses for storing, updating, and querying the data (knowledge) 
it manipulates.

By looking at the 
documentation<https://wmde.github.io/wikidata-wikibase-architecture/assets/img/03-dataflow-out.drawio.17c12ee9.svg>
 we understood that RDF is only used for the Wikidata Query Service, but we 
could not find out exactly how Wikibase/Wikidata stores the information that is 
translated to RDF during the data dump.

More specifically, we understood that a MySQL (or is it MariaDB?) relational 
database is used as the key persistence component for most of Wikibase/Wikidata 
services and that the information that is maintained in this database is 
periodically exported to multiple formats, including RDF.

In addition, looking at the relational database schema published in the 
documentation<https://www.mediawiki.org/wiki/Manual:Database_layout> we could 
not locate tables that are easily mappable to the Wikibase Data 
Model<https://www.mediawiki.org/wiki/Wikibase/DataModel>.
Thus, we hypothesize that there is some software component (Wikibase Common 
Data Access?) that dynamically translates the data contained in those tables to 
Statements, Entities, etc. Is that hypothesis, correct?
If yes, does this software component uses any intermediate storage mechanism 
for caching those Statements, Entities, ...? Or are those translations always 
performed at runtime on-the-fly (be it for querying, adding, or updating 
Statements, Entities, …)?

Finally, we would like to understand more about how Wikidata REST 
API<https://www.wikidata.org/wiki/Wikidata:REST_API> is implemented:

•         In which database are the statements that added/retrieved through it 
stored? Is it being stored in the central MySQL database or in another database?

•         Does it have any support for pagination of statements? For example, 
if an item has many statements associated with a property, does the API assumes 
that both the underlying database and the network will support the retrieval of 
all those statements?

•         Are you currently considering implementing the support for more 
flexible querying of statements, or such requirement has been fully delegated 
to the Wikidata Query Service?

If there is an updated documentation that could help us answer those questions, 
could you kindly point us to it? Otherwise, would you be able to share this 
information with us?

Best Regards,



Elton F. de S. Soares
Advisory Software Engineer
Rio de Janeiro, RJ, Brazil
IBM Research
E-mail: [email protected]<mailto:[email protected]>



_______________________________________________
Wikidata-tech mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to