Re: [Wikitech-l] Request for Comments: Cross site data access for Wikidata

Ryan Lane Mon, 23 Apr 2012 11:07:24 -0700

Why not just use a queue? We use the job queue for this right now for
nearly the same purpose. The job queue isn't amazing, but it works.
Maybe someone should replace this with a better system while they are
at it?


On Mon, Apr 23, 2012 at 5:45 AM, Daniel Kinzler <[email protected]> wrote:
> Hi all!
>
> The wikidata team has been discussing how to best make data from wikidata
> available on local wikis. Fetching the data via HTTP whenever a page is
> re-rendered doesn't  seem prudent, so we (mainly Jeroen) came up with a
> push-based architecture.
>
> The proposal is at
> <http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage>,
> I have copied it below too.
>
> Please have a lot and let us know if you think this is viable, and which of 
> the
> two variants you deem better!
>
> Thanks,
> -- daniel
>
> PS: Please keep the discussion on  wikitech-l, so we have it all in one place.
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> == Proposal: HTTP push to local db storage ==
>
> * Every time an item on Wikidata is changed, an HTTP push is issued to all
> subscribing clients (wikis)
> ** initially, "subscriptions" are just entries in an array in the 
> configuration.
> ** Pushes can be done via the job queue.
> ** pushing is done via the mediawiki API, but other protocols such as PubSub
> Hubbub / AtomPub can easily be added to support 3rd parties.
> ** pushes need to be authenticated, so we don't get malicious crap. Pushes
> should be done using a special user with a special user right.
> ** the push may contain either the full set of information for the item, or 
> just
> a delta (diff) + hash for integrity check (in case an update was missed).
>
> * When the client receives a push, it does two things:
> *# write the fresh data into a local database table (the local wikidata cache)
> *# invalidate the (parser) cache for all pages that use the respective item 
> (for
> now we can assume that we know this from the language links)
> *#* if we only update language links, the page doesn't even need to be
> re-parsed: we just update the languagelinks in the cached ParserOutput object.
>
> * when a page is rendered, interlanguage links and other info is taken from 
> the
> local wikidata cache. No queries are made to wikidata during 
> parsing/rendering.
>
> * In case an update is missed, we need a mechanism to allow requesting a full
> purge and re-fetch of all data from on the client side and not just wait until
> the next push which might very well take a very long time to happen.
> ** There needs to be a manual option for when someone detects this. maybe
> action=purge can be made to do this. Simple cache-invalidation however 
> shouldn't
> pull info from wikidata.
> **A time-to-live could be added to the local copy of the data so that it's
> updated by doing a pull periodically so the data does not stay stale
> indefinitely after a failed push.
>
> === Variation: shared database tables ===
>
> Instead of having a local wikidata cache on each wiki (which may grow big - a
> first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all
> client wikis could  access the same central database table(s) managed by the
> wikidata wiki.
>
> * this is similar to the way the globalusage extension tracks the usage of
> commons images
> * whenever a page is re-rendered, the local wiki would query the table in the
> wikidata db. This means a cross-cluster db query whenever a page is rendered,
> instead a local query.
> * the HTTP push mechanism described above would still be needed to purge the
> parser cache when needed. But the push requests would not need to contain the
> updated data, they may just be requests to purge the cache.
> * the ability for full HTTP pushes (using the mediawiki API or some other
> interface) would still be desirable for 3rd party integration.
>
> * This approach greatly lowers the amount of space used in the database
> * it doesn't change the number of http requests made
> ** it does however reduce the amount of data transferred via http (but not by
> much, at least not compared to pushing diffs)
> * it doesn't change the number of database requests, but it introduces
> cross-cluster requests
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Request for Comments: Cross site data access for Wikidata

Reply via email to