Why not just use a queue? We use the job queue for this right now for nearly the same purpose. The job queue isn't amazing, but it works. Maybe someone should replace this with a better system while they are at it?
On Mon, Apr 23, 2012 at 5:45 AM, Daniel Kinzler <[email protected]> wrote: > Hi all! > > The wikidata team has been discussing how to best make data from wikidata > available on local wikis. Fetching the data via HTTP whenever a page is > re-rendered doesn't seem prudent, so we (mainly Jeroen) came up with a > push-based architecture. > > The proposal is at > <http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage>, > I have copied it below too. > > Please have a lot and let us know if you think this is viable, and which of > the > two variants you deem better! > > Thanks, > -- daniel > > PS: Please keep the discussion on wikitech-l, so we have it all in one place. > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > == Proposal: HTTP push to local db storage == > > * Every time an item on Wikidata is changed, an HTTP push is issued to all > subscribing clients (wikis) > ** initially, "subscriptions" are just entries in an array in the > configuration. > ** Pushes can be done via the job queue. > ** pushing is done via the mediawiki API, but other protocols such as PubSub > Hubbub / AtomPub can easily be added to support 3rd parties. > ** pushes need to be authenticated, so we don't get malicious crap. Pushes > should be done using a special user with a special user right. > ** the push may contain either the full set of information for the item, or > just > a delta (diff) + hash for integrity check (in case an update was missed). > > * When the client receives a push, it does two things: > *# write the fresh data into a local database table (the local wikidata cache) > *# invalidate the (parser) cache for all pages that use the respective item > (for > now we can assume that we know this from the language links) > *#* if we only update language links, the page doesn't even need to be > re-parsed: we just update the languagelinks in the cached ParserOutput object. > > * when a page is rendered, interlanguage links and other info is taken from > the > local wikidata cache. No queries are made to wikidata during > parsing/rendering. > > * In case an update is missed, we need a mechanism to allow requesting a full > purge and re-fetch of all data from on the client side and not just wait until > the next push which might very well take a very long time to happen. > ** There needs to be a manual option for when someone detects this. maybe > action=purge can be made to do this. Simple cache-invalidation however > shouldn't > pull info from wikidata. > **A time-to-live could be added to the local copy of the data so that it's > updated by doing a pull periodically so the data does not stay stale > indefinitely after a failed push. > > === Variation: shared database tables === > > Instead of having a local wikidata cache on each wiki (which may grow big - a > first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all > client wikis could access the same central database table(s) managed by the > wikidata wiki. > > * this is similar to the way the globalusage extension tracks the usage of > commons images > * whenever a page is re-rendered, the local wiki would query the table in the > wikidata db. This means a cross-cluster db query whenever a page is rendered, > instead a local query. > * the HTTP push mechanism described above would still be needed to purge the > parser cache when needed. But the push requests would not need to contain the > updated data, they may just be requests to purge the cache. > * the ability for full HTTP pushes (using the mediawiki API or some other > interface) would still be desirable for 3rd party integration. > > * This approach greatly lowers the amount of space used in the database > * it doesn't change the number of http requests made > ** it does however reduce the amount of data transferred via http (but not by > much, at least not compared to pushing diffs) > * it doesn't change the number of database requests, but it introduces > cross-cluster requests > > > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
