Re: [Wikitech-l] Request for Comments: Cross site data access for Wikidata

Platonides Mon, 23 Apr 2012 10:20:55 -0700

On 23/04/12 18:42, Daniel Kinzler wrote:
> On 23.04.2012 17:28, Platonides wrote:
>> On 23/04/12 14:45, Daniel Kinzler wrote:
>>> *#* if we only update language links, the page doesn't even need to be
>>> re-parsed: we just update the languagelinks in the cached ParserOutput 
>>> object.
>>
>> It's not that simple, for instance, they may be several ParserOutputs
>> for the same page. On the bright side, you probably don't need it. I'd
>> expect that if interwikis are handled through wikidata, they are
>> completely replaced through a hook, so no need to touch the ParserOutput
>> objects.
> 
> I would go that way if we were just talking about languagelinks. But we have 
> to
> provide for phase II (infoboxes) and III (automated lists) too. Since we'll 
> have
> to re-parse in most cases anyway (and parsing pages without infoboxes tends to
> be cheaper anyway), I see no benefit in spending time on inventing a way to
> bypass parsing. It's tempting, granted, but it seems a distraction atm.


Sure, but in those cases you need to reparse the full page. No need to
make tricks modifying the ParserOutput. :)
So, if you want to skip the reparsing for iw fine, but just use a hook.



>> I think a save/purge shall always fetch the data. We can't store the
>> copy in the parsed object.
> 
> well, for languagelinks, we already do, and will probably keep doing it. Other
> data, which will be used in the page content, shouldn't be stored in the 
> parser
> output. The parser should take them from some cache.

The ParserOutput is a parsed representation of the wikitext. The cached
wikidata interwikis shouldn't be stored there (or at least, not only
there, in case it saved the interwikis as they were on last full-render).



>> What we can do is to fetch is from a local cache or directly from the
>> origin one.
> 
> Indeed. Local or remote, DB directly or HTTP... we can have FileRepo-like
> plugins for that, sure. But:
> 
> The real question is how purging and updating will work. Pushing? Polling?
> Purge-and-pull?
> 
>> You mention the cache for the push model, but I think it deserves a
>> clearer separation.
> Can you explain what you have in mind?

I mean, they are based in the same concept. What really matters is how
things reach the db.
I'd have WikiData db replicated to {{places}}.
For WMF, all wikis could connect directly to the main instance, have a
slave "assigned" to each cluster...
Then on each page render, the variables used could be checked with the
latest version (unless checked in last x minutes) and trigger a rerender
if different.

So, suppose a page uses the fact
Germany{capital:"Berlin";language:"German"},
it would store that along the version of WikiData used (eg. Wikidata 2.0,
Germany 488584364).

When going to show it, it would check:
1) Is the latest WikiData version newer than 2.0? (No-> go to 5)
2) Is the Germany module newer than 488584364? (No-> Store that it's up
to date to WikiData 3, go to 5)
3) Fetch Germany data. If the used data hasn't changed, update the
metadata. Go to 5.
4) Re-render the page.
5) Show contents.

As for actively purging the pages content, that's interesting only for
the anons.
You'd need a script able to replicate a purge for a WikiData changes
range. That'd basically perform the above steps, but making the render
through the job queue.
A normal wiki would call those functions while replicating, but wikis
with a shared db (or dropping full files with newer data) would run it
standalone (plus utility on screw ups).


>> You'd probably also want multiple dbs (let's call them WikiData
>> repositories), partitioned by content (and its update frequency). You
>> could then use different frontends (as Chad says, "similar to FileRepo").
>> So, a WikiData repository with the atom properties of each element would
>> happily live in a dba file. Interwikis would have to be on a MySQL db, etc.
> 
> This is what I was aiming at with the DataTransclusion extension a while back.
> 
> But currently, we are not building a tool for including arbitrary data sources
> in wikipedia. We are building a central database for maintaining factual
> information. Our main objective is to get that done.

Not arbitrary, but having different sources (repositories), even if they
are under control of the same entity. Mostly interesting for slow-fast
altough I'm sure reusers would find more use cases, such as only
downloading the db about this section.


> A design that is flexible enough to easily allow for future inclusion of other
> data sources would be nice. As long as the abstraction doesn't get in the way.
> 
> Anyway, it seems that it boils down to this:
> 
> 1) The client needs some (abstracted?) way to access the 
> reporitory/repositories
> 2) The repo needs to be able to notify the client sites about changes, be it 
> via
> push, pr purge, or polling.
> 3) We'll need a local cache or cross-site database access.
> 
> So, which combination of these techniques would you prefer?
> 
> -- daniel

I'd use a pull-based model. That seems to be what fits better with
current MediaWiki model. But it isn't too relevant at this time (or you
have advanced a lot by now!).


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Request for Comments: Cross site data access for Wikidata

Reply via email to