Eevans added a comment.

This use case seems similar to caching parsoid HTML, which is done in RESTbase and backed by Cassandra. It's similar, because it's re-generated upon edit, and accessed from clients upon view, via an API. It's also similar in that losing this data is not absolutely critical, as it can be regenerated, but having to re-generate all of it may cause a problematic spike in load on application servers (and databases and the query service).

However, in contrast to the parsoid use case, information does not need to be stored for old revisions.

As to the model: the wikidata folks will have the details, but as far as I'm aware, it's a JSON blob for each Wikidata entity (items, properties, etc). Granularity could be increased to per-statement blobs.

Puring is, as far as I known, currently only done per edit of the subject. However, use cases for bulk purges exist (in particular, when constraints definitions change), but they are just ignored at the moment, as far as I know. I could be wrong about that, though.

If I understand the above correctly, we're saying that this is strictly key/value, where the key is an entity ID, and the value an opaque JSON blob. When the subject is edited, the value is overwritten with the most recent constraint check. And when the format of constraint definitions change, we need to be able to bulk purge previous entries in the obsolete format. Is this correct?

Some additional questions...

An opaque k/v store won't allow anything but discrete lookup by entity ID, how are violations queried? In other words, this seems to only be a small part of the larger model, what does that look like, and why are we creating this separation (i.e. what problem does this solve)?

Numbers regarding total number of entities, and the size of the values will be important of course, but perhaps most important will be some idea about access patterns. How frequently will entities be (over)written? How often read? I realize the answer to this is probably a distribution, and that this may involve some educated guess work.

What happens if constraint definitions change? Are we able to wholesale drop the older ones? Is the constraint check inlined on a miss, and is the latency (and additional load) under such circumstances acceptable? Or will some sort of transition be needed where we fall back to the older check when that's available, and replace them gradually?

I'll probably have more questions.


TASK DETAIL
https://phabricator.wikimedia.org/T204024

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Eevans
Cc: Eevans, daniel, mobrovac, Jonas, Lucas_Werkmeister_WMDE, Aklapper, Addshore, Lahi, Gq86, GoranSMilovanovic, QZanden, merbst, LawExplorer, Agabi10, Hardikj, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, fgiunchedi
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to