[Wikidata-bugs] [Maniphest] [Commented On] T107595: [RFC] Multi-Content Revisions

daniel Wed, 04 May 2016 10:42:05 -0700

daniel added a comment.

  In https://phabricator.wikimedia.org/T107595#2264799, @GWicke wrote:

  > It is not entirely clear to me whether PageUpdater (and RevisionUpdater) 
are meant to only handle synchronous low-level updates, or whether they are 
meant to orchestrate asynchronous change propagation as well. I would suggest 
focusing PageUpdater and RevisionUpdater on synchronous / low-level updates 
only, and leave asynchronous change propagation to EventBus / the change 
propagation service.

  RevisionUpdater/RevisionBuilder operates on the same level as Revision: no 
secondary data, no notifications. Just storage.

  PageUpdater would operate on the same level as WikiPage, but I think we 
should first get RevisionBuilder working, and leave PageUpdater as it is, for 
now.

  In any case, the PageUpdater / WikiPage code needs to trigger notifications 
(produce events). I don't care what mechanism it used for that. Or rather: I'm 
very happy if we get a generalized mechanism. We'll have to agree on some kind 
of schema for revisions, slots, and blobs, but that should be easy enough.

  >> The bob-store is (potentially) content-adressable, so the same blob may be 
used for different revisions of different pages.
  > 
  > Blob sharing would complicate your storage significantly, as you'd either 
have to forgo deleting content forever (very expensive for something like HTML 
renders), 
  >  or incur significant complexity of implementing an atomic reference 
counting scheme.

  I have pushed by the //derived slots// in my mind until we have the 
//primary// slots working. I agree that for "volatile" data, we'd not want to 
use content-adressable blobs, for the reason you menationed.

  > For textual content, I am pretty certain that sharing is rare, and the 
complexity would overall be a loss in performance and reliability.

  Sharing between different pages is probable rare, but:

  >> Even for blobs that have an incremental ID (e.g. using the current text 
table storage mechanism), the same blob would frequently be used for multiple 
blobs of the same page.

  Blobs would typically be shared by different revisions of the //same// page. 
This happens every time one primary slot is edited, but another is not changed. 
E.g. the free wikitext description of a file is edited, but the structured data 
isn't (or vice versa). Or the quality assessment data of an article is updated, 
but the article text isn't edited. In both cases, one of the blobs would be 
re-used by the new revision. I think this will actually be more common than 
editing all primary streams at once.

  > How would a dumb blob store figure out which content belongs to the same 
page (and is thus similar), if all it has is the content & some metadata, but 
not the page id, title, revision & render UUID? This is the same design issue 
that plagues ExternalStore, and something we addressed in RESTBase. With 
large-window compression algorithms like brotli, we are getting down to 2-3% of 
the input HTML size (see https://phabricator.wikimedia.org/T122028). Without 
this locality information, you are likely to use an order of magnitude more 
storage as you are foregoing efficient delta compression.

  This is a good point. Once again, we want our abstraction to be a bit leaky, 
to allow for optimizations.

  I havn't thought this through yet, but my inclanation is that we could 
associate a metadata array (k/v set) with the blob, which could include things 
like a hash and the page title. A BlobStore would be free to use this or not, 
to store it or not, and to make it retrievable or not.

  > I am generally trying to work out how RevisionContentLookup would work for 
use cases like fetching HTML from RESTBase. Some notes / questions:
  > 
  > - In addition to title and revision (which I assume remains an integer), 
we'd need an optional v1 UUID parameter to retrieve specific renders, in both 
the request & response interfaces.

  I have thought about this, too. My solution is to encode this in the slot 
name. So you could have an html.canonical (sub)slot, and a 
html.29e68f78-8765-49f8-86d5-dfc438d459fe, or html.en, or whatever.

  > - Will getTouched() return the UUID timestamp of a specific render 
(last-modified, essentially), or is this about page_touched? Also, should we 
expose UUIDs to make sure that we have a unique ID with a high-resolution 
timestamp?

  getTouched() will return the touch date of the slot. For primary slots, this 
will always be the revision (edit) timestamp. For derived slots, it would be 
the time that slot was last updated [i'd love to use a logical clock for this, 
instead of wall clock time...].

  I'd expose URLs. Their format would be left to the blob store. Could be a 
UUID.

  > - For content from RESTBase, read restrictions are always enforced as part 
of the API request. No information about the applied restrictions is returned. 
In this context, getReadRestrictions() would basically always return the empty 
set.

  That's fine. getReadRestrictions() tells mediaWiki to enforce restrictions. 
If the restrictions are enforced "further down", no problem.

TASK DETAIL
  https://phabricator.wikimedia.org/T107595

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel
Cc: Glaisher, JJMC89, RobLa-WMF, Yurik, ArielGlenn, APerson, TomT0m, Krenair, 
intracer, Tgr, Tobi_WMDE_SW, Addshore, Lydia_Pintscher, cscott, PleaseStand, 
awight, Ricordisamoa, GWicke, MarkTraceur, waldyrious, Legoktm, Aklapper, 
Jdforrester-WMF, Ltrlg, brion, Spage, MZMcBride, daniel, D3r1ck01, Izno, 
Luke081515, Wikidata-bugs, aude, jayvdb, fbstj, Mbch331, Jay8g, bd808

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T107595: [RFC] Multi-Content Revisions

Reply via email to