Re: [Wikitech-l] Wikidata blockers weekly update

Denny Vrandečić Fri, 10 Aug 2012 05:04:09 -0700

Hi Daniel,

thanks for your comments. Some of the suggestions you make would
extend the functionality beyond what we need right now. They look
certainly useful, and I don't think that we would make implementing
them any harder than it is right now -- rather the opposite.


As usual, the perfect and the next-step are great enemies. I
understand that the patch does not lead to a perfect world directly,
that would cover all of your use cases -- but it nicely covers ours.

My questions would be:

* do you think that we are going in the wrong direction, or do you
think we are not going far enough yet?

* do you think that we are making some use cases harder to implement
in the future than they would be now, and if so which ones?

* do you see other issues with the patch that should block it from
being deployed, and which ones would that be?

Cheers,
Denny



2012/8/10 Daniel Friesen <[email protected]>:
> On 12-08-09 3:55 PM, Jeroen De Dauw wrote:
>> Hey,
>>
>> You bring up some good points.
>>
>>     I think we're going to need to have some of this and the
>>     synchronization stuff in core.
>>     Right now the code has nothing but the one sites table. No repo
>>     code so presumably the only implementation of that for awhile will
>>     be wikidata. And if parts of this table is supposed to be editable
>>     in some cases where there is no repo but non-editable then I don't
>>     see any way for an edit ui to tell the difference.
>>
>>
>> We indeed need some configuration setting(s) for wikis to distinguish
>> between the two cases. That seems to be all "synchronisation code"
>> we'll need in core. It might or might not be useful to have more logic
>> in core, or in some dedicated extension. Personally I think having the
>> actual synchronization code in a separate extension would be nice, as
>> a lot of it won't be Wikidata specific. This is however not a
>> requirement for Wikidata, so the current plan is to just have it in
>> the extension, always keeping in mind that it should be easy to split
>> it off later on. I'd love to discuss this point further, but it should
>> be clear this is not much of a blocker for the current code, as it
>> seems unlikely to affect it much, if at all.
>>
>> On that note consider we're initially creating the new system in
>> parallel with the old one, which enabled us to just try out changes,
>> and alter them later on if it turns out there is a better way to do
>> them. Then once we're confident the new system is what we want to
>> stick to, and know it works because of it's usage by Wikidata, we can
>> replace the current code with the new system. This ought to allow us
>> to work a lot faster by not blocking on discussions and details for to
>> long.
>>
>> > I'm also not sure how this synchronization which sounds like one-way
>> will play with individual wikis wanting to add new interwiki links.
>>
>> For our case we only need it to work one way, from the Wikidata repo
>> to it's clients. More discussion would need to happen to decide on an
>> alternate approach. I already indicated I think this is not a blocker
>> for the current set of changes, so I'd prefer this to happen after the
>> current code got merged.
>>
>>     I'm talking about things like the interwiki extensions and scripts
>>     that turn wiki tables into interwiki lists. All these things are
>>     written against the interwiki table. So by rewriting and using a
>>     new table we implicitly break all the working tricks and throw the
>>     user back into SQL.
>>
>>
>> I am aware of this. Like noted already, the current new code does not
>> yet replace the old code, so this is not a blocker yet, but it will be
>> for replacing the old code with the new system. Having looked at the
>> existing code using the old system, I think migration should not be to
>> hard, since the new system can do everything the old one can do and
>> the current using code is not that much. The new system also has clear
>> interfaces, preventing the script from needing to know of the database
>> table at all. That ought to facilitate the "do not depend on a single
>> db table" a lot, obviously :)
>>
>>     I like the idea of table entries without actual interwikis. The
>>     idea of some interface listing user selectable sites came to mind
>>     and perhaps sites being added trivially even automatically.
>>     Though if you plan to support this I think you'll need to drop the
>>     NOT NULL from site_local_key.
>>
>>
>> I don't think the field needs to allow for null - right now the local
>> keys on the repo will be by default the same as the global keys, so
>> none of them will be null. On your client wiki you will then have
>> these values by default as well. If you don't want a particular site
>> to be usable as "languagelink" or "interwikilink", then simply set
>> this in your local configuration. No need to set the local id to null.
>> Depending on how actually we end up handling the defaulting process,
>> having null might or might not turn out to be useful. This is a detail
>> though, so I'd suggest sticking with not null for now, and then if it
>> turns out I'd be more convenient to allow for null when writing the
>> sync code, just change it then.
> You mean site_config?
> You're suggesting the interwiki system should look for a site by
> site_local_key, when it finds one parse out the site_config, check if
> it's disabled and if so ignore the fact it found a site with that local
> key? Instead of just not having a site_local_key for that row in the
> first place?
>
>>     Actually, another thought makes me think the schema should be a
>>     little different.
>>     site_local_key probably shouldn't be a column, it should probably
>>     be another table.
>>     Something like site_local_key (slc_key, slc_site) which would map
>>     things like en:, Wikipedia:, etc... to a specific site.
>>
>>
>> Denny and I discussed this at some length, now already more then a
>> month ago (man, this is taking long...). Our conclusions where that we
>> do not need it, or would benefit from it much in Wikidata. In fact,
>> I'd introduce additional complexity, which is a good argument for not
>> including it in our already huge project. I do agree that conceptually
>> it's nicer to not duplicate such info, but if you consider the extra
>> complexity you'd need to get rid of it, and the little gain you have
>> (removal of some minor duplication which we've had since forever and
>> is not bothering anyone), I'm sceptical we ought to go with this
>> approach, even outside of Wikidata.
> You've added global ids into this mix. So data duplication simply
> because one wiki needs a second local name will mean that one url now
> has two different global ids this sounds precisely like something that
> is going to get in the way of the whole reason you wanted this rewrite.
> It will also start to create issues with the sync code.
> Additionally the number of duplicates needed is going to vary wiki by
> wiki. en.wikisource is going to need one Wikipedia: to link to en.wp
> while fr.wp is going to need two, Wikipedia: and en: to point to en.wp.
> I can only see data duplication creating more problems than we need.
>
> As for the supposed complexity of this extra table. site_data and
> site_config are blobs of presumably serialized data. You've already
> eliminated the simplicity needed for this to be human editable from SQL
> so there is no reason to hold back on making the database schema the
> best it can be. As for deletions if you're worried about making them
> simple just add a foreign key with cascading deletion. Then the rows in
> site_local_key will automatically be deleted when you delete the row in
> sites without any extra complexity.
>
>>     I think I need to understand the plans you have for
>>     synchronization a bit more.
>>     - Where does Wikidata get the sites
>>
>>
>> The repository wiki holds the canonical copy of the sites, which gets
>> send to all clients. Modification of the site data can only happen on
>> the repository. All wikis (repo and clients) have their own local
>> config so can choose to enable all sites for all functionality,
>> completely hide them, or anything in between.
> Ok, I'm leaning more and more towards the idea that we should make the
> full sites table a second-class index of sites pulled from any number of
> data sources that you can carelessly truncate and have rebuilt (ie: it
> has no more value than pagelinks).
> Wikidata's data syncing would be served by creating a secondary table
> with the local link_{key,inline,navigation}, forward, and config
> columns. When you sync the data from the Wikidata repo and the site
> local table would be combined to create what goes into the index table
> with the full list of sites.
> Doing it this way frees us from creating any restrictions on whatever
> source we get sites from that we shouldn't be placing on them.
> Wikidata gets site local stuff and global data and doesn't have to worry
> about whether parts of the row are supposed to be editable or not. There
> is nothing stopping us from making our first non-wikidata site source a
> plaintext file so we have time to write a really good UI. And the UI is
> free from restrictions placed by using this one table, so it's free to
> do it in whatever way fits a UI best. Whether that means it's an
> editable wikitext page or better yet a nice ui using that abstract
> revision system I wanted to build.
>
>>     - What synchronizes the data
>>
>>
>> The repo. As already mentioned, it might be nicer to split this off in
>> it's own extension at some point. But before we get to that, we first
>> need to have the current changes merged.
>>
>>     Btw if you really want to make this an abstract list of sites
>>     dropping site_url and the other two related columns might be an idea.
>>     At first glance the url looks like something standard that every
>>     site would have. But once you throw something like MediaWiki into
>>     the mix with short urls, long urls, and an API the url really
>>     becomes type specific data that should probably go in the blob.
>>     Especially when you start thinking about other custom types.
>>
>>
>> The patch sitting on gerrit already includes this. (Did you really
>> look at it already? The fields are documented quite well I'd think.)
>> Every site has a url (that's not specific to the type of site), but we
>> have a type system with currently the default (general) site type and
>> a MediaWikiSite type. The type system works with two blob fields, one
>> for type specific data and one for type specific configuration.
> Yeah, I looked at the schema I know there is a data blob, that's what
> I'm talking about. I mean while you'd think that a url is something
> every site would have one of it's actually more of a type specific piece
> of data because some site types can actually have multiple urls, etc...
> which depend on what the page input is. So you might as well drop the 3
> url related columns and just use the data blob that you already have.
> The $1 pattern may not even work for some sites. For example something
> like a gerrit type may want to know a specific root path for gerrit
> without any $1 funny business and then handle what actual url gets
> output in special ways. ie: So that [[gerrit:14295]] links to
> https://gerrit.wikimedia.org/r/#/c/14295 while [[gerrit:
> I0a96e58556026d8c923551b07af838ca426a2ab3]] links to
> https://gerrit.wikimedia.org/r/#q,I0a96e58556026d8c923551b07af838ca426a2ab3,n,z
>
>> Cheers
>>
>> --
>> Jeroen De Dauw
>> http://www.bn2vs.com
>> Don't panic. Don't be evil.
>> --
>
> ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Wikidata blockers weekly update

Reply via email to