[Wikidata-bugs] [Maniphest] [Commented On] T214362: RFC: Store WikibaseQualityConstraint check data in persistent storage

Addshore Tue, 16 Jul 2019 10:07:17 -0700

Addshore added a comment.

  In T214362#5324623 <https://phabricator.wikimedia.org/T214362#5324623>, 
@daniel wrote:

  > Moved to the RFC backlog for improvement after discussion at the TechCom 
meeting. The proposed functionality seems sensible enough, but this ticket is 
lacking information about system design that is needed to make this viable as 
an RFC.
  > Most importantly, the proposal assumes the existence of a "more permanent 
storage solution" which is not readily available. This would have to be created.

  I guess the closest thing we have like it right now would be the parser cache 
system backed by MySQL.

  > Which raises a number of questions, like:
  >
  > - what volume of data do you expect that store to hold?

  I can't talk in terms of bytes right now, but I we can add a bit of tracking 
to our current cache to try and figure out an average size and figure out a 
rough total size from that if that''s what we want.
  If we are talking about number of entries, this would roughly line up with 
the number of wikidata entities, which is right now 58 million.

  > - should data ever be evicted? Does it *have* to be evicted?

  It does not *have* to be "evicted", but there will be situations where it is 
detected to be out of date and thus regenerated.

  > - how bad is it if we lose some data unexpectedly?

  Not very, everything can and will be regenerated, but takes time.

  > How bad is it for all the data to become unavailable?

  Unavailable or totally lost?
  Unavailable for a short period of time would not be critical.
  Unavailable for longer periods of time could have knock on effects to other 
services such as WDQS not being able to update fully once T201147 
<https://phabricator.wikimedia.org/T201147> is complete, but I'm sure whatever 
update code is created would be able to handle such a situation.

  Totally loosing all of the data would be pretty bad, it would probably take 
an extreme amount of time to regenerate at a regular pace for all entities.

  > - what's the read/write load?

  Write load once the job is fully deployed would be roughly the wikidata edit 
rate, but limited / controlled by the job queue rate for "constraintsRunCheck".
  This can be guesstimated at 250-750 per minute max, but there will also be 
de-duplication for edits to the same pages to account for there.
  If more exact numbers are required we can have a go at figuring that out.
  Currently the job is only running on 25% of edits.

  Read rate can currently be seen at 
https://grafana.wikimedia.org/d/000000344/wikidata-quality?panelId=6&fullscreen&orgId=1
  On top of this the WDQS updaters would also be needing this data once 
generated.
  This would either be via a http api request which would likely hit the 
storage, or this could possibly be sent in some event queue?

  > - what are the requirements for cross-DC replication?

  Having the data accessible from both DCs (for the DC failover case) should be 
a requirement.

  > - what transactional consistency requirements exist?

  Not any super important requirements here.
  If we write to the store we would love for it to be written and readable in 
the next second.
  Writes for a single key will not really happen too close together, probably 
multiple seconds between them.
  Interaction between keys and order of writes being committed to the store 
isn't really important.

  > - what's the access pattern? Is a plain K/V store sufficient, or are other 
kinds of indexes/queries needed?

  Just a plain K/V store.

  > Also, so you have a specific storage technology in mind? In discussions 
about this, Cassandra seems to regularly pop up, but it's not in the proposal. 
As far as I know, there is currently no good way to access Cassandra directly 
from MW core (not abstraction layer, but apparently also no decent PHP driver 
at all, and IIRC there are also issues with network topology).

  For technology we don't have any particular preferences, whatever works for 
the WMF, ops and tech comm.
  Ideally something that we would be able to get access to and start working 
with sooner rather than later.

  > I was hoping for @Joe and @mobrovac to ask more specific questions, but 
they are both on vacation right now. Perhaps get together with them to hash out 
a proposal when they are back.

  More than happy to try and hash this out a bit more in this ticket before 
passing it back to a tech comm meeting again.
  It'd be great to try and make some progress here in the coming month.

TASK DETAIL
  https://phabricator.wikimedia.org/T214362

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: mobrovac, abian, Lydia_Pintscher, Lucas_Werkmeister_WMDE, Marostegui, Joe, 
daniel, Agabi10, Aklapper, Addshore, darthmon_wmde, holger.knust, Nandana, 
kostajh, Lahi, Gq86, GoranSMilovanovic, QZanden, merbst, LawExplorer, _jensen, 
rosalieper, D3r1ck01, Pchelolo, SBisson, Eevans, Hardikj, Wikidata-bugs, aude, 
GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Ladsgroup, Mbch331, Rxy, 
Jay8g, Ltrlg, bd808, Legoktm

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T214362: RFC: Store WikibaseQualityConstraint check data in persistent storage

Reply via email to