GWicke added a comment.
> In any case, the PageUpdater / WikiPage code needs to trigger notifications (produce events). I don't care what mechanism it used for that. Or rather: I'm very happy if we get a generalized mechanism. We'll have to agree on some kind of schema for revisions, slots, and blobs, but that should be easy enough. Makes sense. Thanks for the clarification! >> In addition to title and revision (which I assume remains an integer), we'd need an optional v1 UUID parameter to retrieve specific renders, in both the request & response interfaces. > I have thought about this, too. My solution is to encode this in the slot name. So you could have an html.canonical (sub)slot, and a html.29e68f78-8765-49f8-86d5-dfc438d459fe, or html.en, or whatever. Hmmm, this sounds like a rather ugly hack. I thought the 'slot' is identifying the kind of content, and is not some general-purpose string that is used to append otherwise missing parameters, and differs with each render. >> How would a dumb blob store figure out which content belongs to the same page (and is thus similar), if all it has is the content & some metadata, but not the page id, title, revision & render UUID? This is the same design issue that plagues ExternalStore, and something we addressed in RESTBase. With large-window compression algorithms like brotli, we are getting down to 2-3% of the input HTML size (see https://phabricator.wikimedia.org/T122028). Without this locality information, you are likely to use an order of magnitude more storage as you are foregoing efficient delta compression. > > This is a good point. Once again, we want our abstraction to be a bit leaky, to allow for optimizations. I would argue that it is a case of finding an abstraction at the right level. A simple blob store is a very low-level abstraction, and severely limits the backend's abilities to optimize storage, distribution & consistency. It also limits the backend's usefulness as an API in its own right. Instead, I think we should clearly define the API for each slot to provide / consume - page id, - page title, - revision id, and - a UUID / hash / etag. This makes sure that backends can continue to implement higher-level functionality & important optimizations. This should be part of the API, and not a case of a "leak". That said, backends *can* choose to ignore all of this (but the UUID / hash). > I havn't thought this through yet, but my inclanation is that we could associate a metadata array (k/v set) with the blob, which could include things like a hash and the page title. A BlobStore would be free to use this or not, to store it or not, and to make it retrievable or not. A minimum set of metadata (like the versioned content-type) should always be provided. It would be nice to model this in a way that's compatible with normal HTTP headers, as stored & returned by services like RESTBase. TASK DETAIL https://phabricator.wikimedia.org/T107595 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: daniel, GWicke Cc: Glaisher, JJMC89, RobLa-WMF, Yurik, ArielGlenn, APerson, TomT0m, Krenair, intracer, Tgr, Tobi_WMDE_SW, Addshore, Lydia_Pintscher, cscott, PleaseStand, awight, Ricordisamoa, GWicke, MarkTraceur, waldyrious, Legoktm, Aklapper, Jdforrester-WMF, Ltrlg, brion, Spage, MZMcBride, daniel, D3r1ck01, Izno, Luke081515, Wikidata-bugs, aude, jayvdb, fbstj, Mbch331, Jay8g, bd808 _______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs