GWicke added a comment.

  > In any case, the PageUpdater / WikiPage code needs to trigger notifications 
(produce events). I don't care what mechanism it used for that. Or rather: I'm 
very happy if we get a generalized mechanism. We'll have to agree on some kind 
of schema for revisions, slots, and blobs, but that should be easy enough.
  
  Makes sense. Thanks for the clarification!
  
  >> In addition to title and revision (which I assume remains an integer), 
we'd need an optional v1 UUID parameter to retrieve specific renders, in both 
the request & response interfaces.
  
  
  
  > I have thought about this, too. My solution is to encode this in the slot 
name. So you could have an html.canonical (sub)slot, and a 
html.29e68f78-8765-49f8-86d5-dfc438d459fe, or html.en, or whatever.
  
  Hmmm, this sounds like a rather ugly hack. I thought the 'slot' is 
identifying the kind of content, and is not some general-purpose string that is 
used to append otherwise missing parameters, and differs with each render.
  
  >> How would a dumb blob store figure out which content belongs to the same 
page (and is thus similar), if all it has is the content & some metadata, but 
not the page id, title, revision & render UUID? This is the same design issue 
that plagues ExternalStore, and something we addressed in RESTBase. With 
large-window compression algorithms like brotli, we are getting down to 2-3% of 
the input HTML size (see https://phabricator.wikimedia.org/T122028). Without 
this locality information, you are likely to use an order of magnitude more 
storage as you are foregoing efficient delta compression.
  > 
  > This is a good point. Once again, we want our abstraction to be a bit 
leaky, to allow for optimizations.
  
  I would argue that it is a case of finding an abstraction at the right level. 
A simple blob store is a very low-level abstraction, and severely limits the 
backend's abilities to optimize storage, distribution & consistency. It also 
limits the backend's usefulness as an API in its own right.
  
  Instead, I think we should clearly define the API for each slot to provide / 
consume
  
  - page id,
  - page title,
  - revision id, and
  - a UUID / hash / etag.
  
  This makes sure that backends can continue to implement higher-level 
functionality & important optimizations. This should be part of the API, and 
not a case of a "leak". That said, backends *can* choose to ignore all of this 
(but the UUID / hash).
  
  > I havn't thought this through yet, but my inclanation is that we could 
associate a metadata array (k/v set) with the blob, which could include things 
like a hash and the page title. A BlobStore would be free to use this or not, 
to store it or not, and to make it retrievable or not.
  
  A minimum set of metadata (like the versioned content-type) should always be 
provided. It would be nice to model this in a way that's compatible with normal 
HTTP headers, as stored & returned by services like RESTBase.

TASK DETAIL
  https://phabricator.wikimedia.org/T107595

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel, GWicke
Cc: Glaisher, JJMC89, RobLa-WMF, Yurik, ArielGlenn, APerson, TomT0m, Krenair, 
intracer, Tgr, Tobi_WMDE_SW, Addshore, Lydia_Pintscher, cscott, PleaseStand, 
awight, Ricordisamoa, GWicke, MarkTraceur, waldyrious, Legoktm, Aklapper, 
Jdforrester-WMF, Ltrlg, brion, Spage, MZMcBride, daniel, D3r1ck01, Izno, 
Luke081515, Wikidata-bugs, aude, jayvdb, fbstj, Mbch331, Jay8g, bd808



_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to