Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

2012-11-08 Thread Daniel Kinzler
First off, TL;DR:
(@Tim: did my best to summarize, please correct any misrepresentation.)

* Tim: don't re-parse when sitelinks change.

* Daniel: can be done, but do we really need to optimize for this case? Denny,
can we get better figures on this?

* Daniel: how far do we want to limit the things we make available via parser
functions and/or lua binding? Coudl allow more with LUA (faster, and
implementing complex functionality via parser function is nasty anyway).

* Consensus: we want to coalesc changes before acting on them.

* Tim: we also want to avoid redundant rendering by removing duplicate render
jobs (like multiple re-rendering of the same page) resulting from the changes.

* Tim: large batches (lower frequency) for re-rendering pages that are already
invalidated would allow more dupes to be removed. (Pages would still be rendered
on demand when viewed, but link tables would update later)

* Daniel: sounds good, but perhaps this should be a general feature of the
re-render/linksUpdate job queue, so it's also used when templates get edited.

* Consensus: load items directly from ES (via remote access to the repo's text
table), cache in memcached.

* Tim:  Also get rid of local item - page mapping, just look each page up on
the repo.

* Daniel: Ok, but then we can't optimize bulk ops involving multiple items.

* Tim: run the polling script from the repo, push to client wiki db's directly

* Daniel: that's scary, client wikis should keep control of how changes are 
handled.


Now, the nitty gritty:

On 08.11.2012 01:51, Tim Starling wrote:
 On 07/11/12 22:56, Daniel Kinzler wrote:
 As far as I can see, we then can get the updated language links before the 
 page
 has been re-parsed, but we still need to re-parse eventually. 
 
 Why does it need to be re-parsed eventually?

For the same reason pages need to be re-parsed when templates change: because
links may depend on the data items.

We are currently working on the assumption that *any* aspect of a data item is
accessible vis parser functions in the wikitext, and may thus infuence any
apsect of that page's parser output.

So, if *anything* about a data items chanes, *anything* about the wikipedia page
using it may change too. So that page needs to be re-parsed.

Maybe we'll be able to cut past the rendering for some cases, but for normal
property changes, like a new value for the population of a country, all pages
that use the respective data item need to be re-rendered soonish, otherwise the
link tables (especially categories) will get out of whack.

So, let's thinkabout what we *could* optimize:

* I think we could probably disallow access to wikidata sitelinks via parser
functions in wikipedia articles. That would allows us to use an optimized data
flow for changes to sitelinks (aka language links) which does not cause the page
to be re-rendered.

* Maybe we can also avoid re-parsing pages on changes that apply only to
languages that are not used on the respective wiki (lets call them unrelated
translation changes). The tricky bit here is to figure out changes to which
language effect which wiki in the presence of complex language fallback rules
(e.g. nds-de-mul or even nastier stuff involvinc circular relations between
language variants).

* Changes to labels, descriptions and aliases of items on wikidata will
*probably* not influence the content of wikipedia pages. We could disallow
acccess to these aspects of data items to make sure - this would be a shame, but
not terrible. At least not for infoboxes. For automatically generated lists
we'll need the labels at the very least.

* We could keep track of which properties of the data item are actually used on
each page, and then only re-parse of these properties change. That would be
quite a bit of data, and annoying to maintain, but possible. Whether this has a
large impact on the need to re-parse remains to be seen, since it greately
depends on the infobox templates.

We can come up with increasingly complex rules for skipping rendering, but
except perhaps for the sitelink changes, this seems brittle and confusing. I'd
like to avoid it as much as possible.

 And, when someone
 actually looks at the page, the page does get parsed/rendered right away, and
 the user sees the updated langlinks. So... what do we need the
 pre-parse-update-of-langlinks for? Where and when would they even be used? I
 don't see the point.
 
 For language link updates in particular, you wouldn't have to update
 page_touched, so the page wouldn't have to be re-parsed.

If the languagelinks in the sidebar come from memcached and not the cached
parser output, then yes.

 We could get around this, but even then it would be an optimization for 
 language
 links. But wikidata is soon going to provide data for infoboxes. Any 
 aspect of a
 data item could be sued in an {{#if:...}}. So we need to re-render the page
 whenever an item changes.

 Wikidata is somewhere around 61000 physical lines of code now. Surely
 somewhere 

Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

2012-11-07 Thread Daniel Kinzler
On 07.11.2012 00:41, Tim Starling wrote:
 On 06/11/12 23:16, Daniel Kinzler wrote:
 On 05.11.2012 05:43, Tim Starling wrote:
 On 02/11/12 22:35, Denny Vrandečić wrote:
 * For re-rendering the page, the wiki needs access to the data.
 We are not sure about how do to this best: have it per cluster,
 or in one place only?

 Why do you need to re-render a page if only the language links are
 changed? Language links are only in the navigation area, the wikitext
 content is not affected.

 Because AFAIK language links are cached in the parser output object, and
 rendered into the skin from there. Asking the database for them every time 
 seems
 like overhead if the cached ParserOutput already has them... I believe we
 currently use the one from the PO if it's there. Am I wrong about that?
 
 You can use memcached.

Ok, let me see if I understand what you are suggesting.

So, in memcached, we'd have the language links for every page (or as many as fit
in there); actually, three lists per page: one of the links defined on the page
itself, one of the links defined by wikidata, and one of the wikidata links
suppressed locally.

When generating the langlinks in the sidebar, these two would be combined
appropriately. If we don't find anything in memcached for this of course, we
need to parse the page to get the locally defined language links.

When wikidata updates, we just update the record in memcached and invalidate the
page.

As far as I can see, we then can get the updated language links before the page
has been re-parsed, but we still need to re-parse eventually. And, when someone
actually looks at the page, the page does get parsed/rendered right away, and
the user sees the updated langlinks. So... what do we need the
pre-parse-update-of-langlinks for? Where and when would they even be used? I
don't see the point.

 We could get around this, but even then it would be an optimization for 
 language
 links. But wikidata is soon going to provide data for infoboxes. Any aspect 
 of a
 data item could be sued in an {{#if:...}}. So we need to re-render the page
 whenever an item changes.
 
 Wikidata is somewhere around 61000 physical lines of code now. Surely
 somewhere in that mountain of code, there is a class for the type of
 an item, where an update method can be added.

I don't understand what you are suggesting. At the moment, when
EntityContent::save() is called, it will trigger a change notification, which is
written to the wb_changes table. On the client side, a maintenance script polls
that table. What could/should be changed about that?

 I don't think it is feasible to parse pages very much more frequently
 than they are already parsed as a result of template updates (i.e.
 refreshLinks jobs). 

I don't see why we would parse more frequently. An edit is an edit, locally or
remotely. If you want a language link to be updated, the page needs
to be reparsed, whether that is triggered by wikidata or a bot edit. At least,
wikidata doesn't create a new revision.

 The CPU cost of template updates is already very
 high. Maybe it would be possible if the updates were delayed, run say
 once per day, to allow more effective duplicate job removal. Template
 updates should probably be handled in the same way.

My proposal is indeed unclear on one point: it does not clearly distinguish
between invalidating a page and re-rendering it. I think denny mentioned
re-rendering in his original mail. The fact is: At the moment, we do not
re-render at all. We just invalidate. And I think that's good enough for now.

I don't see how that duplicate removal would work beyond the coalescing I
already suggested - except that for a large batch that covers a whole day, a lot
more can be coalesced.

 Of course, with template updates, you don't have to wait for the
 refreshLinks job to run before the new content becomes visible,
 because page_touched is updated and Squid is purged before the job is
 run. That may also be feasible with Wikidata.

We call Title::invalidateCache(). That ought to do it, right?

 If a page is only viewed once a week, you don't want to be rendering
 it 5 times per day. The idea is to delay rendering until the page is
 actually requested, and to update links periodically.

As I said, we currently don't re-render at all, and whether and when we should
is up for discussion. Maybe there could just be a background job re-rendering
all dirty pages every 24 hours or so, to keep the link tables up to date.

Note that we do need to re-parse eventually: Infoboxes will contain things like
{{#property:population}}, which need to be invalidated when the data item
changes. Any aspect of a data item can be used in conditionals:

{{#if:{{#property:commons-gallery}}|{{commons|{{#property:commons-gallery}}

Sitelinks (Language links) too can be accessed via parser functions and used in
conditionals.

 The reason I think duplicate removal is essential is because entities
 will be updated in batches. For example, a census 

Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

2012-11-07 Thread Tim Starling
On 07/11/12 22:56, Daniel Kinzler wrote:
 As far as I can see, we then can get the updated language links before the 
 page
 has been re-parsed, but we still need to re-parse eventually. 

Why does it need to be re-parsed eventually?

 And, when someone
 actually looks at the page, the page does get parsed/rendered right away, and
 the user sees the updated langlinks. So... what do we need the
 pre-parse-update-of-langlinks for? Where and when would they even be used? I
 don't see the point.

For language link updates in particular, you wouldn't have to update
page_touched, so the page wouldn't have to be re-parsed.

 We could get around this, but even then it would be an optimization for 
 language
 links. But wikidata is soon going to provide data for infoboxes. Any aspect 
 of a
 data item could be sued in an {{#if:...}}. So we need to re-render the page
 whenever an item changes.

 Wikidata is somewhere around 61000 physical lines of code now. Surely
 somewhere in that mountain of code, there is a class for the type of
 an item, where an update method can be added.
 
 I don't understand what you are suggesting. At the moment, when
 EntityContent::save() is called, it will trigger a change notification, which 
 is
 written to the wb_changes table. On the client side, a maintenance script 
 polls
 that table. What could/should be changed about that?

I'm saying that you don't really need the client-side maintenance
script, it can be done just with repo-side jobs. That would reduce the
job insert rate by a factor of the number of languages, and make the
task of providing low-latency updates to client pages somewhat easier.

For language link updates, you just need to push to memcached, purge
Squid and insert a row into recentchanges. For #property, you
additionally need to update page_touched and construct a de-duplicated
batch of refreshLinks jobs to be run on the client side on a daily basis.

 I don't think it is feasible to parse pages very much more frequently
 than they are already parsed as a result of template updates (i.e.
 refreshLinks jobs). 
 
 I don't see why we would parse more frequently. An edit is an edit, locally or
 remotely. If you want a language link to be updated, the page needs
 to be reparsed, whether that is triggered by wikidata or a bot edit. At least,
 wikidata doesn't create a new revision.

Surely Wikidata will dramatically increase the amount of data
available on the infoboxes of articles in small wikis, and improve the
freshness of that data. If it doesn't, something must have gone
terribly wrong.

Note that the current system is inefficient, sometimes to the point of
not working at all. When bot edits on zhwiki cause a job queue backlog
6 months long, or data templates cause articles to take a gigabyte of
RAM and 15 seconds to render, I tell people don't worry, I'm sure
Wikidata will fix it. I still think we can deliver on that promise,
with proper attention to system design.

 Of course, with template updates, you don't have to wait for the
 refreshLinks job to run before the new content becomes visible,
 because page_touched is updated and Squid is purged before the job is
 run. That may also be feasible with Wikidata.
 
 We call Title::invalidateCache(). That ought to do it, right?

You would have to also call Title::purgeSquid(). But it's not
efficient to use these Title methods when you have thousands of pages
to purge, that's why we use HTMLCacheUpdate for template updates.

 Sitelinks (Language links) too can be accessed via parser functions and used 
 in
 conditionals.

Presumably that would be used fairly rarely. You could track it
separately, or remove the feature, in order to provide efficient
language link updates as I described.

 The reason I think duplicate removal is essential is because entities
 will be updated in batches. For example, a census in a large country
 might result in hundreds of thousands of item updates.
 
 Yes, but for different items. How can we remove any duplicate updates if there
 is just one edit per item? Why would there be multiple?

I'm not talking about removing duplicate item edits, I'm talking about
avoiding running multiple refreshLinks jobs for each client page. I
thought refreshLinks was what Denny was talking about when he said
re-render, thanks for clearing that up.

 Ok, so there would be a re-parse queue with duplicate removal. When a change
 notification is processed (after coalescing notifications), the target page is
 invalidated using Title::invalidateCache() and it's also placed in the 
 re-parse
 queue to be processed later. How is this different from the job queue used for
 parsing after template edits?

There's no duplicate removal with template edits, and no 24-hour delay
in updates to improve the effectiveness of duplicate removal.

It's the same problem, it's just that the current system for template
edits is cripplingly inefficient and unscalable. So I'm bringing up
these performance ideas before Wikidata 

Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

2012-11-06 Thread Daniel Kinzler
On 05.11.2012 05:43, Tim Starling wrote:
 On 02/11/12 22:35, Denny Vrandečić wrote:
 * For re-rendering the page, the wiki needs access to the data.
 We are not sure about how do to this best: have it per cluster,
 or in one place only?
 
 Why do you need to re-render a page if only the language links are
 changed? Language links are only in the navigation area, the wikitext
 content is not affected.

Because AFAIK language links are cached in the parser output object, and
rendered into the skin from there. Asking the database for them every time seems
like overhead if the cached ParserOutput already has them... I believe we
currently use the one from the PO if it's there. Am I wrong about that?

We could get around this, but even then it would be an optimization for language
links. But wikidata is soon going to provide data for infoboxes. Any aspect of a
data item could be sued in an {{#if:...}}. So we need to re-render the page
whenever an item changes.

Also, when the page is edited manually, and then rendered, the wiki need to
somehow know a) which item ID is associated with this page and b) it needs to
load the item data to be able to render the page (just the language links, or
also infobox data, or eventually also the result of a wikidata query as a list).

 As I've previously explained, I don't think the langlinks table on the
 client wiki should be updated. So you only need to purge Squid and add
 an entry to Special:RecentChanges.

If the language links from wikidata is not fulled in during rendering and stored
in the parseroutput object, and it's also not stored in the langlinks table,
where is it stored, then? How should we display it?

 Purging Squid can certainly be done from the context of a wikidatawiki
 job. For RecentChanges the main obstacle is accessing localisation
 text. You could use rc_params to store language-independent message
 parameters, like what we do for log entries.

We also need to resolve localized namespace names so we can put the correct
namespace id into the RC table. I don't see a good way to do this from the
context of another wiki (without using the web api).

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

2012-11-06 Thread Denny Vrandečić
Hi Tim,

thank you for the input.

Wikidata unfortunately will not contain all language links: a wiki can
locally overwrite the list (by extending the list, suppressing a link
from Wikidata, or replacing a link). This is a requirement as not all
language links are necessarily symmetric (although I wish they were).
This means there is some interplay between the wikitext and the links
coming from Wikidata. An update to the links coming from Wikidata can
have different effects on the actually displayed language links
depending on what is in the local wikitext.

Now, we could possibly also save the effects defined in the local
wikitext (which links are suppressed, which are additionally locally
defined) in the DB as well, and then when they get changed externally
smartly combine that is and create the new correct list --- but this
sounds like a lot of effort. It would potentially save cycles compared
to today's situation. But the proposed solution does not *add* cycles
compared to today's situation. Today, the bots that keep the language
links in synch, basically incur a re-rendering of the page anyway, we
would not being adding any cost on top of that. We do not make matters
worse with regards to server costs.

Also it would, as Daniel mentioned, be an optimization which only
would work for the language links. Once we add further data, that will
be available to the wikitext, this will not work at all anymore.

I hope this explains why we think that the re-rendering is helpful.


Having said that, here's an alternative scenario: Assuming we do not
send any re-rendering jobs to the Wikipedias, what is the worst that
would happen?

For this to answer, I need the answer to this question first: are the
squids and caches holding their content indefinitely, or would the
data, in the worst case, just be out of synch for, say, up to 24 hours
on a Wikipedia article that didn't have an edit at all?

If we do not re-render, I assume editors will come up with their own
workflows (e.g. changing some values in Wikidata, going to their home
wiki, purge the effected page, or write a script that gives them a
purge my homewiki page-link on Wikidata), which is fine, and still
cheaper than if we initiate re-rendering all pages every time. It just
means that in some cases some pages will not be up to date.

So, we could go without re-rendering at all, if there is consensus
that this is the preferred solution and that this is better than the
solution we suggested.

Anyone having any comments, questions, or insights?

Cheers,
Denny



2012/11/5 Tim Starling tstarl...@wikimedia.org:
 On 02/11/12 22:35, Denny Vrandečić wrote:
 * For re-rendering the page, the wiki needs access to the data.
 We are not sure about how do to this best: have it per cluster,
 or in one place only?

 Why do you need to re-render a page if only the language links are
 changed? Language links are only in the navigation area, the wikitext
 content is not affected.

 As I've previously explained, I don't think the langlinks table on the
 client wiki should be updated. So you only need to purge Squid and add
 an entry to Special:RecentChanges.

 Purging Squid can certainly be done from the context of a wikidatawiki
 job. For RecentChanges the main obstacle is accessing localisation
 text. You could use rc_params to store language-independent message
 parameters, like what we do for log entries.

 -- Tim Starling


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

2012-11-06 Thread Daniel Kinzler
On 06.11.2012 17:01, Denny Vrandečić wrote:
 So, we could go without re-rendering at all, if there is consensus
 that this is the preferred solution and that this is better than the
 solution we suggested.
 
 Anyone having any comments, questions, or insights?

I already suggested that changes to the same page in the same batch could be
coalesced together. The larger we make the batches, the more updates we optimize
away, but the longer it takes until the Wikipedia pages update.

Once we implement this, I think polling frequency and batch size are good tuning
screws we can easily use to control the burden of rerendering and other kinds of
updates. But we still would have control over how and when changes get pushed.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

2012-11-06 Thread Tim Starling
On 06/11/12 23:16, Daniel Kinzler wrote:
 On 05.11.2012 05:43, Tim Starling wrote:
 On 02/11/12 22:35, Denny Vrandečić wrote:
 * For re-rendering the page, the wiki needs access to the data.
 We are not sure about how do to this best: have it per cluster,
 or in one place only?

 Why do you need to re-render a page if only the language links are
 changed? Language links are only in the navigation area, the wikitext
 content is not affected.
 
 Because AFAIK language links are cached in the parser output object, and
 rendered into the skin from there. Asking the database for them every time 
 seems
 like overhead if the cached ParserOutput already has them... I believe we
 currently use the one from the PO if it's there. Am I wrong about that?

You can use memcached.

 We could get around this, but even then it would be an optimization for 
 language
 links. But wikidata is soon going to provide data for infoboxes. Any aspect 
 of a
 data item could be sued in an {{#if:...}}. So we need to re-render the page
 whenever an item changes.

Wikidata is somewhere around 61000 physical lines of code now. Surely
somewhere in that mountain of code, there is a class for the type of
an item, where an update method can be added.

I don't think it is feasible to parse pages very much more frequently
than they are already parsed as a result of template updates (i.e.
refreshLinks jobs). The CPU cost of template updates is already very
high. Maybe it would be possible if the updates were delayed, run say
once per day, to allow more effective duplicate job removal. Template
updates should probably be handled in the same way.

Of course, with template updates, you don't have to wait for the
refreshLinks job to run before the new content becomes visible,
because page_touched is updated and Squid is purged before the job is
run. That may also be feasible with Wikidata.

If a page is only viewed once a week, you don't want to be rendering
it 5 times per day. The idea is to delay rendering until the page is
actually requested, and to update links periodically.

A page which is viewed once per week is not an unrealistic scenario.
We will probably have bot-generated geographical articles for just
about every town in the world, in 200 or so languages, and all of them
will pull many entities from Wikidata. The majority of those articles
will be visited by search engine crawlers much more often than they
are visited by humans.

The reason I think duplicate removal is essential is because entities
will be updated in batches. For example, a census in a large country
might result in hundreds of thousands of item updates.

What I'm suggesting is not quite the same as what you call
coalescing in your design document. Coalescing allows you to reduce
the number of events in recentchanges, and presumably also the number
of Squid purges and page_touched updates. I'm saying that even after
coalescing, changes should be merged further to avoid unnecessaray
parsing.

 Also, when the page is edited manually, and then rendered, the wiki need to
 somehow know a) which item ID is associated with this page and b) it needs to
 load the item data to be able to render the page (just the language links, or
 also infobox data, or eventually also the result of a wikidata query as a 
 list).

You could load the data from memcached while the page is being parsed,
instead of doing it in advance, similar to what we do for images.
Dedicating hundreds of processor cores to parsing articles immediately
after every wikidata change doesn't sound like a great way to avoid a
few memcached queries.

 As I've previously explained, I don't think the langlinks table on the
 client wiki should be updated. So you only need to purge Squid and add
 an entry to Special:RecentChanges.
 
 If the language links from wikidata is not fulled in during rendering and 
 stored
 in the parseroutput object, and it's also not stored in the langlinks table,
 where is it stored, then? 

In the wikidatawiki DB, cached in memcached.

 How should we display it?

Use an OutputPage or Skin hook, such as OutputPageParserOutput.

 Purging Squid can certainly be done from the context of a wikidatawiki
 job. For RecentChanges the main obstacle is accessing localisation
 text. You could use rc_params to store language-independent message
 parameters, like what we do for log entries.
 
 We also need to resolve localized namespace names so we can put the correct
 namespace id into the RC table. I don't see a good way to do this from the
 context of another wiki (without using the web api).

You can get the namespace names from $wgConf and localisation cache,
and then duplicate the code from Language::getNamespaces() to put it
all together, along the lines of:

$wgConf-loadFullData();
$extraNamespaces = $wgConf-get( 'wgExtraNamespaces', $wiki ) );
$metaNamespace = $wgConf-get( 'wgMetaNamespace', $wiki );
$metaNamespaceTalk = $wgConf-get( 'wgMetaNamespace', $wiki );
list( $site, $lang ) = 

Re: [Wikitech-l] Data flow from Wikidata to the Wikipedias

2012-11-04 Thread Tim Starling
On 02/11/12 22:35, Denny Vrandečić wrote:
 * For re-rendering the page, the wiki needs access to the data.
 We are not sure about how do to this best: have it per cluster,
 or in one place only?

Why do you need to re-render a page if only the language links are
changed? Language links are only in the navigation area, the wikitext
content is not affected.

As I've previously explained, I don't think the langlinks table on the
client wiki should be updated. So you only need to purge Squid and add
an entry to Special:RecentChanges.

Purging Squid can certainly be done from the context of a wikidatawiki
job. For RecentChanges the main obstacle is accessing localisation
text. You could use rc_params to store language-independent message
parameters, like what we do for log entries.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l