[Wikidata-bugs] [Maniphest] [Updated] T107595: [RFC] Multi-Content Revisions

GWicke Wed, 04 May 2016 10:04:05 -0700

GWicke added a comment.


  > Where do I propose another mechanism for change propagation? The 
PageUpdater would do exactly what Revision does now: schedule DataUpdates.
  
  EventBus & the change propagation service are moving away from scheduling 
"jobs", and towards an event processing approach based on Kafka. In this model, 
subscribers react to change events associated with resources. Event production 
& processing / consumption is decoupled and decentralized.
  
  PageUpdater (and RevisionUpdater) as proposed seem to be moving in the 
opposite direction, towards more jobs & away from event processing.
  
  > The bob-store is (potentially) content-adressable, so the same blob may be 
used for different revisions of different pages.
  
  Blob sharing would complicate your storage significantly, as you'd either 
have to forgo deleting content forever (very expensive for something like HTML 
renders), or incur significant complexity of implementing an atomic reference 
counting scheme. For textual content, I am pretty certain that sharing is rare, 
and the complexity would overall be a loss in performance and reliability.
  
  > Even for blobs that have an incremental ID (e.g. using the current text 
table storage mechanism), the same blob would frequently be used for multiple 
blobs of the same page.
  
  How would a dumb blob store figure out which content belongs to the same page 
(and is thus similar), if all it has is the content & some metadata, but not 
the page id, title, revision & render UUID? This is the same design issue that 
plagues ExternalStore, and something we addressed in RESTBase. With 
large-window compression algorithms like brotli, we are getting down to 2-3% of 
the input HTML size (see https://phabricator.wikimedia.org/T122028). Without 
this locality information, you are likely to use an order of magnitude more 
storage as you are foregoing efficient delta compression.
  
  I am generally trying to work out how RevisionContentLookup would work for 
use cases like fetching HTML from RESTBase. Some notes / questions:
  
  - In addition to title and revision (which I assume remains an integer), 
we'll need an optional v1 UUID parameter to retrieve specific renders, in both 
the request & response interfaces.
  - Will getTouched() return the UUID timestamp of a specific render 
(last-modified, essentially), or is this about page_touched? Also, should we 
expose UUIDs to make sure that we have a unique ID with a high-resolution 
timestamp?
  - For content from RESTBase, read restrictions are always enforced as part of 
the API request. No information about the applied restrictions is returned. In 
this context, getReadRestrictions() would basically always return the empty set.

TASK DETAIL
  https://phabricator.wikimedia.org/T107595

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel, GWicke
Cc: Glaisher, JJMC89, RobLa-WMF, Yurik, ArielGlenn, APerson, TomT0m, Krenair, 
intracer, Tgr, Tobi_WMDE_SW, Addshore, Lydia_Pintscher, cscott, PleaseStand, 
awight, Ricordisamoa, GWicke, MarkTraceur, waldyrious, Legoktm, Aklapper, 
Jdforrester-WMF, Ltrlg, brion, Spage, MZMcBride, daniel, D3r1ck01, Izno, 
Luke081515, Wikidata-bugs, aude, jayvdb, fbstj, Mbch331, Jay8g, bd808



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Updated] T107595: [RFC] Multi-Content Revisions

Reply via email to