daniel added a comment.
In https://phabricator.wikimedia.org/T107595#2264799, @GWicke wrote: > It is not entirely clear to me whether PageUpdater (and RevisionUpdater) are meant to only handle synchronous low-level updates, or whether they are meant to orchestrate asynchronous change propagation as well. I would suggest focusing PageUpdater and RevisionUpdater on synchronous / low-level updates only, and leave asynchronous change propagation to EventBus / the change propagation service. RevisionUpdater/RevisionBuilder operates on the same level as Revision: no secondary data, no notifications. Just storage. PageUpdater would operate on the same level as WikiPage, but I think we should first get RevisionBuilder working, and leave PageUpdater as it is, for now. In any case, the PageUpdater / WikiPage code needs to trigger notifications (produce events). I don't care what mechanism it used for that. Or rather: I'm very happy if we get a generalized mechanism. We'll have to agree on some kind of schema for revisions, slots, and blobs, but that should be easy enough. >> The bob-store is (potentially) content-adressable, so the same blob may be used for different revisions of different pages. > > Blob sharing would complicate your storage significantly, as you'd either have to forgo deleting content forever (very expensive for something like HTML renders), > or incur significant complexity of implementing an atomic reference counting scheme. I have pushed by the //derived slots// in my mind until we have the //primary// slots working. I agree that for "volatile" data, we'd not want to use content-adressable blobs, for the reason you menationed. > For textual content, I am pretty certain that sharing is rare, and the complexity would overall be a loss in performance and reliability. Sharing between different pages is probable rare, but: >> Even for blobs that have an incremental ID (e.g. using the current text table storage mechanism), the same blob would frequently be used for multiple blobs of the same page. Blobs would typically be shared by different revisions of the //same// page. This happens every time one primary slot is edited, but another is not changed. E.g. the free wikitext description of a file is edited, but the structured data isn't (or vice versa). Or the quality assessment data of an article is updated, but the article text isn't edited. In both cases, one of the blobs would be re-used by the new revision. I think this will actually be more common than editing all primary streams at once. > How would a dumb blob store figure out which content belongs to the same page (and is thus similar), if all it has is the content & some metadata, but not the page id, title, revision & render UUID? This is the same design issue that plagues ExternalStore, and something we addressed in RESTBase. With large-window compression algorithms like brotli, we are getting down to 2-3% of the input HTML size (see https://phabricator.wikimedia.org/T122028). Without this locality information, you are likely to use an order of magnitude more storage as you are foregoing efficient delta compression. This is a good point. Once again, we want our abstraction to be a bit leaky, to allow for optimizations. I havn't thought this through yet, but my inclanation is that we could associate a metadata array (k/v set) with the blob, which could include things like a hash and the page title. A BlobStore would be free to use this or not, to store it or not, and to make it retrievable or not. > I am generally trying to work out how RevisionContentLookup would work for use cases like fetching HTML from RESTBase. Some notes / questions: > > - In addition to title and revision (which I assume remains an integer), we'd need an optional v1 UUID parameter to retrieve specific renders, in both the request & response interfaces. I have thought about this, too. My solution is to encode this in the slot name. So you could have an html.canonical (sub)slot, and a html.29e68f78-8765-49f8-86d5-dfc438d459fe, or html.en, or whatever. > - Will getTouched() return the UUID timestamp of a specific render (last-modified, essentially), or is this about page_touched? Also, should we expose UUIDs to make sure that we have a unique ID with a high-resolution timestamp? getTouched() will return the touch date of the slot. For primary slots, this will always be the revision (edit) timestamp. For derived slots, it would be the time that slot was last updated [i'd love to use a logical clock for this, instead of wall clock time...]. I'd expose URLs. Their format would be left to the blob store. Could be a UUID. > - For content from RESTBase, read restrictions are always enforced as part of the API request. No information about the applied restrictions is returned. In this context, getReadRestrictions() would basically always return the empty set. That's fine. getReadRestrictions() tells mediaWiki to enforce restrictions. If the restrictions are enforced "further down", no problem. TASK DETAIL https://phabricator.wikimedia.org/T107595 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: daniel Cc: Glaisher, JJMC89, RobLa-WMF, Yurik, ArielGlenn, APerson, TomT0m, Krenair, intracer, Tgr, Tobi_WMDE_SW, Addshore, Lydia_Pintscher, cscott, PleaseStand, awight, Ricordisamoa, GWicke, MarkTraceur, waldyrious, Legoktm, Aklapper, Jdforrester-WMF, Ltrlg, brion, Spage, MZMcBride, daniel, D3r1ck01, Izno, Luke081515, Wikidata-bugs, aude, jayvdb, fbstj, Mbch331, Jay8g, bd808 _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
