Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

Daniel Kinzler Wed, 07 Nov 2012 03:56:41 -0800

On 07.11.2012 00:41, Tim Starling wrote:
> On 06/11/12 23:16, Daniel Kinzler wrote:
>> On 05.11.2012 05:43, Tim Starling wrote:
>>> On 02/11/12 22:35, Denny Vrandečić wrote:
>>>> * For re-rendering the page, the wiki needs access to the data.
>>>> We are not sure about how do to this best: have it per cluster,
>>>> or in one place only?
>>>
>>> Why do you need to re-render a page if only the language links are
>>> changed? Language links are only in the navigation area, the wikitext
>>> content is not affected.
>>
>> Because AFAIK language links are cached in the parser output object, and
>> rendered into the skin from there. Asking the database for them every time 
>> seems
>> like overhead if the cached ParserOutput already has them... I believe we
>> currently use the one from the PO if it's there. Am I wrong about that?
> 
> You can use memcached.


Ok, let me see if I understand what you are suggesting.

So, in memcached, we'd have the language links for every page (or as many as fit
in there); actually, three lists per page: one of the links defined on the page
itself, one of the links defined by wikidata, and one of the wikidata links
suppressed locally.

When generating the langlinks in the sidebar, these two would be combined
appropriately. If we don't find anything in memcached for this of course, we
need to parse the page to get the locally defined language links.

When wikidata updates, we just update the record in memcached and invalidate the
page.

As far as I can see, we then can get the updated language links before the page
has been re-parsed, but we still need to re-parse eventually. And, when someone
actually looks at the page, the page does get parsed/rendered right away, and
the user sees the updated langlinks. So... what do we need the
pre-parse-update-of-langlinks for? Where and when would they even be used? I
don't see the point.

>> We could get around this, but even then it would be an optimization for 
>> language
>> links. But wikidata is soon going to provide data for infoboxes. Any aspect 
>> of a
>> data item could be sued in an {{#if:...}}. So we need to re-render the page
>> whenever an item changes.
> 
> Wikidata is somewhere around 61000 physical lines of code now. Surely
> somewhere in that mountain of code, there is a class for the type of
> an item, where an update method can be added.

I don't understand what you are suggesting. At the moment, when
EntityContent::save() is called, it will trigger a change notification, which is
written to the wb_changes table. On the client side, a maintenance script polls
that table. What could/should be changed about that?

> I don't think it is feasible to parse pages very much more frequently
> than they are already parsed as a result of template updates (i.e.
> refreshLinks jobs). 

I don't see why we would parse more frequently. An edit is an edit, locally or
remotely. If you want a language link to be updated, the page needs
to be reparsed, whether that is triggered by wikidata or a bot edit. At least,
wikidata doesn't create a new revision.

> The CPU cost of template updates is already very
> high. Maybe it would be possible if the updates were delayed, run say
> once per day, to allow more effective duplicate job removal. Template
> updates should probably be handled in the same way.

My proposal is indeed unclear on one point: it does not clearly distinguish
between invalidating a page and re-rendering it. I think denny mentioned
re-rendering in his original mail. The fact is: At the moment, we do not
re-render at all. We just invalidate. And I think that's good enough for now.

I don't see how that duplicate removal would work beyond the coalescing I
already suggested - except that for a large batch that covers a whole day, a lot
more can be coalesced.

> Of course, with template updates, you don't have to wait for the
> refreshLinks job to run before the new content becomes visible,
> because page_touched is updated and Squid is purged before the job is
> run. That may also be feasible with Wikidata.

We call Title::invalidateCache(). That ought to do it, right?

> If a page is only viewed once a week, you don't want to be rendering
> it 5 times per day. The idea is to delay rendering until the page is
> actually requested, and to update links periodically.

As I said, we currently don't re-render at all, and whether and when we should
is up for discussion. Maybe there could just be a background job re-rendering
all "dirty" pages every 24 hours or so, to keep the link tables up to date.

Note that we do need to re-parse eventually: Infoboxes will contain things like
{{#property:population}}, which need to be invalidated when the data item
changes. Any aspect of a data item can be used in conditionals:

{{#if:{{#property:commons-gallery}}|{{commons|{{#property:commons-gallery}}}}}}

Sitelinks (Language links) too can be accessed via parser functions and used in
conditionals.

> The reason I think duplicate removal is essential is because entities
> will be updated in batches. For example, a census in a large country
> might result in hundreds of thousands of item updates.

Yes, but for different items. How can we remove any duplicate updates if there
is just one edit per item? Why would there be multiple?

(Note: the current UI only supports atomic edits, one value at a time. The API
however allows bots to change any number of values at once, reducing the number
of change events.)

> What I'm suggesting is not quite the same as what you call
> "coalescing" in your design document. Coalescing allows you to reduce
> the number of events in recentchanges, and presumably also the number
> of Squid purges and page_touched updates. I'm saying that even after
> coalescing, changes should be merged further to avoid unnecessaray
> parsing.

Ok, so there would be a re-parse queue with duplicate removal. When a change
notification is processed (after coalescing notifications), the target page is
invalidated using Title::invalidateCache() and it's also placed in the re-parse
queue to be processed later. How is this different from the job queue used for
parsing after template edits?

>> Also, when the page is edited manually, and then rendered, the wiki need to
>> somehow know a) which item ID is associated with this page and b) it needs to
>> load the item data to be able to render the page (just the language links, or
>> also infobox data, or eventually also the result of a wikidata query as a 
>> list).
> 
> You could load the data from memcached while the page is being parsed,
> instead of doing it in advance, similar to what we do for images.

How does it get into memcached? What if it's not there?

> Dedicating hundreds of processor cores to parsing articles immediately
> after every wikidata change doesn't sound like a great way to avoid a
> few memcached queries.

Yea, as I said above, this is a misunderstanding. We don't insist on immediate
reparsing, we just think the pages need to be invalidated (i.e. *scheduled* for
parsing). I'll adjust the proposal to reflect that distinction.

>>> As I've previously explained, I don't think the langlinks table on the
>>> client wiki should be updated. So you only need to purge Squid and add
>>> an entry to Special:RecentChanges.
>>
>> If the language links from wikidata is not fulled in during rendering and 
>> stored
>> in the parseroutput object, and it's also not stored in the langlinks table,
>> where is it stored, then? 
> 
> In the wikidatawiki DB, cached in memcached.
> 
>> How should we display it?
> 
> Use an OutputPage or Skin hook, such as OutputPageParserOutput.

Do I understand correctly that the point of this is to be able to update the
sitelinks quickly, without parsing the page? We *do* need to parse the page
anyway, though doing so later or only when the page is requested would probably
be fine.

Note that I'd still suggest to write the *effective* language links to the
langlink table, for consistency. I don't see a problem with that.

> You can get the namespace names from $wgConf and localisation cache,
> and then duplicate the code from Language::getNamespaces() to put it
> all together, along the lines of:
> 
> $wgConf->loadFullData();
> $extraNamespaces = $wgConf->get( 'wgExtraNamespaces', $wiki ) );
> $metaNamespace = $wgConf->get( 'wgMetaNamespace', $wiki );
> $metaNamespaceTalk = $wgConf->get( 'wgMetaNamespace', $wiki );
> list( $site, $lang ) = $wgConf->siteFromDB( $wiki );
> $defaults = Language::getLocalisationCache()
>    ->getItem( $lang,'namespaceNames' );
> 
> But using the web API and caching the result in a file in
> $wgCacheDirectory would be faster and easier. $wgConf->loadFullData()
> takes about 16ms, it's much slower than reading a small local file.

Writing to another wiki's database without a firm handle on that wiki's config
sounds quite scary and brittle to me. It can be done, and we can pull together
all the necessary info, but... do you really think this is a good idea? What are
we gaining by doing it this way?

> Like every other sort of link, entity links should probably be tracked
> using the page_id of the origin (local) page, so that the link is not
> invalidated when the page moves. 

This is the wrong way around: sitelinks go from wikidata to wikipedia. As with
all links, link targets are tracked by title, and break when stuff is renamed.
When you move a page on Wikipedia, it loses it's connection to the Wikidata
item, unless you update the Wikidata item (we plan to offer a button on the page
move form on wikipedia to do this conveniently).

> So when you update recentchanges, you
> can select the page_namespace from the page table. So the problem of
> namespace display would occur on the repo UI side.

There's two use cases to consider:

* when a change notification comes in, we need to inject the corresponding
record into the rc table of every wiki using the respective item. To do that, we
need access to some aspects of that wiki's config. Your proposal for caching the
namespace info would cover that.

* when a page is re-rendered, we need access to the data item, so we can pull in
the data fields via parser functions (in phase II). How does page Foo know that
it needs to load item Q5432? And how does it load the item data?

I currently envision that the page <-> item mapping would be maintained locally,
so a simple lookup would provide the item ID. And the item data could ideally be
pulled from ES - that needs some refactoring though. Our current solution has a
cache table with the full uncompressed item data (latest revision only), which
could be maintained on every cluster or only on the repo. I'm now inclined
though to implement direct ES access. I have poked around a bit, and it seems
that this is possible without factoring out standalone BlobStore classes
(although that would still be nice). I'll put a note into the proposal to that
effect.

-- daniel




_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

Reply via email to