Hi Daniel, thanks for your comments. Some of the suggestions you make would extend the functionality beyond what we need right now. They look certainly useful, and I don't think that we would make implementing them any harder than it is right now -- rather the opposite.
As usual, the perfect and the next-step are great enemies. I understand that the patch does not lead to a perfect world directly, that would cover all of your use cases -- but it nicely covers ours. My questions would be: * do you think that we are going in the wrong direction, or do you think we are not going far enough yet? * do you think that we are making some use cases harder to implement in the future than they would be now, and if so which ones? * do you see other issues with the patch that should block it from being deployed, and which ones would that be? Cheers, Denny 2012/8/10 Daniel Friesen <[email protected]>: > On 12-08-09 3:55 PM, Jeroen De Dauw wrote: >> Hey, >> >> You bring up some good points. >> >> I think we're going to need to have some of this and the >> synchronization stuff in core. >> Right now the code has nothing but the one sites table. No repo >> code so presumably the only implementation of that for awhile will >> be wikidata. And if parts of this table is supposed to be editable >> in some cases where there is no repo but non-editable then I don't >> see any way for an edit ui to tell the difference. >> >> >> We indeed need some configuration setting(s) for wikis to distinguish >> between the two cases. That seems to be all "synchronisation code" >> we'll need in core. It might or might not be useful to have more logic >> in core, or in some dedicated extension. Personally I think having the >> actual synchronization code in a separate extension would be nice, as >> a lot of it won't be Wikidata specific. This is however not a >> requirement for Wikidata, so the current plan is to just have it in >> the extension, always keeping in mind that it should be easy to split >> it off later on. I'd love to discuss this point further, but it should >> be clear this is not much of a blocker for the current code, as it >> seems unlikely to affect it much, if at all. >> >> On that note consider we're initially creating the new system in >> parallel with the old one, which enabled us to just try out changes, >> and alter them later on if it turns out there is a better way to do >> them. Then once we're confident the new system is what we want to >> stick to, and know it works because of it's usage by Wikidata, we can >> replace the current code with the new system. This ought to allow us >> to work a lot faster by not blocking on discussions and details for to >> long. >> >> > I'm also not sure how this synchronization which sounds like one-way >> will play with individual wikis wanting to add new interwiki links. >> >> For our case we only need it to work one way, from the Wikidata repo >> to it's clients. More discussion would need to happen to decide on an >> alternate approach. I already indicated I think this is not a blocker >> for the current set of changes, so I'd prefer this to happen after the >> current code got merged. >> >> I'm talking about things like the interwiki extensions and scripts >> that turn wiki tables into interwiki lists. All these things are >> written against the interwiki table. So by rewriting and using a >> new table we implicitly break all the working tricks and throw the >> user back into SQL. >> >> >> I am aware of this. Like noted already, the current new code does not >> yet replace the old code, so this is not a blocker yet, but it will be >> for replacing the old code with the new system. Having looked at the >> existing code using the old system, I think migration should not be to >> hard, since the new system can do everything the old one can do and >> the current using code is not that much. The new system also has clear >> interfaces, preventing the script from needing to know of the database >> table at all. That ought to facilitate the "do not depend on a single >> db table" a lot, obviously :) >> >> I like the idea of table entries without actual interwikis. The >> idea of some interface listing user selectable sites came to mind >> and perhaps sites being added trivially even automatically. >> Though if you plan to support this I think you'll need to drop the >> NOT NULL from site_local_key. >> >> >> I don't think the field needs to allow for null - right now the local >> keys on the repo will be by default the same as the global keys, so >> none of them will be null. On your client wiki you will then have >> these values by default as well. If you don't want a particular site >> to be usable as "languagelink" or "interwikilink", then simply set >> this in your local configuration. No need to set the local id to null. >> Depending on how actually we end up handling the defaulting process, >> having null might or might not turn out to be useful. This is a detail >> though, so I'd suggest sticking with not null for now, and then if it >> turns out I'd be more convenient to allow for null when writing the >> sync code, just change it then. > You mean site_config? > You're suggesting the interwiki system should look for a site by > site_local_key, when it finds one parse out the site_config, check if > it's disabled and if so ignore the fact it found a site with that local > key? Instead of just not having a site_local_key for that row in the > first place? > >> Actually, another thought makes me think the schema should be a >> little different. >> site_local_key probably shouldn't be a column, it should probably >> be another table. >> Something like site_local_key (slc_key, slc_site) which would map >> things like en:, Wikipedia:, etc... to a specific site. >> >> >> Denny and I discussed this at some length, now already more then a >> month ago (man, this is taking long...). Our conclusions where that we >> do not need it, or would benefit from it much in Wikidata. In fact, >> I'd introduce additional complexity, which is a good argument for not >> including it in our already huge project. I do agree that conceptually >> it's nicer to not duplicate such info, but if you consider the extra >> complexity you'd need to get rid of it, and the little gain you have >> (removal of some minor duplication which we've had since forever and >> is not bothering anyone), I'm sceptical we ought to go with this >> approach, even outside of Wikidata. > You've added global ids into this mix. So data duplication simply > because one wiki needs a second local name will mean that one url now > has two different global ids this sounds precisely like something that > is going to get in the way of the whole reason you wanted this rewrite. > It will also start to create issues with the sync code. > Additionally the number of duplicates needed is going to vary wiki by > wiki. en.wikisource is going to need one Wikipedia: to link to en.wp > while fr.wp is going to need two, Wikipedia: and en: to point to en.wp. > I can only see data duplication creating more problems than we need. > > As for the supposed complexity of this extra table. site_data and > site_config are blobs of presumably serialized data. You've already > eliminated the simplicity needed for this to be human editable from SQL > so there is no reason to hold back on making the database schema the > best it can be. As for deletions if you're worried about making them > simple just add a foreign key with cascading deletion. Then the rows in > site_local_key will automatically be deleted when you delete the row in > sites without any extra complexity. > >> I think I need to understand the plans you have for >> synchronization a bit more. >> - Where does Wikidata get the sites >> >> >> The repository wiki holds the canonical copy of the sites, which gets >> send to all clients. Modification of the site data can only happen on >> the repository. All wikis (repo and clients) have their own local >> config so can choose to enable all sites for all functionality, >> completely hide them, or anything in between. > Ok, I'm leaning more and more towards the idea that we should make the > full sites table a second-class index of sites pulled from any number of > data sources that you can carelessly truncate and have rebuilt (ie: it > has no more value than pagelinks). > Wikidata's data syncing would be served by creating a secondary table > with the local link_{key,inline,navigation}, forward, and config > columns. When you sync the data from the Wikidata repo and the site > local table would be combined to create what goes into the index table > with the full list of sites. > Doing it this way frees us from creating any restrictions on whatever > source we get sites from that we shouldn't be placing on them. > Wikidata gets site local stuff and global data and doesn't have to worry > about whether parts of the row are supposed to be editable or not. There > is nothing stopping us from making our first non-wikidata site source a > plaintext file so we have time to write a really good UI. And the UI is > free from restrictions placed by using this one table, so it's free to > do it in whatever way fits a UI best. Whether that means it's an > editable wikitext page or better yet a nice ui using that abstract > revision system I wanted to build. > >> - What synchronizes the data >> >> >> The repo. As already mentioned, it might be nicer to split this off in >> it's own extension at some point. But before we get to that, we first >> need to have the current changes merged. >> >> Btw if you really want to make this an abstract list of sites >> dropping site_url and the other two related columns might be an idea. >> At first glance the url looks like something standard that every >> site would have. But once you throw something like MediaWiki into >> the mix with short urls, long urls, and an API the url really >> becomes type specific data that should probably go in the blob. >> Especially when you start thinking about other custom types. >> >> >> The patch sitting on gerrit already includes this. (Did you really >> look at it already? The fields are documented quite well I'd think.) >> Every site has a url (that's not specific to the type of site), but we >> have a type system with currently the default (general) site type and >> a MediaWikiSite type. The type system works with two blob fields, one >> for type specific data and one for type specific configuration. > Yeah, I looked at the schema I know there is a data blob, that's what > I'm talking about. I mean while you'd think that a url is something > every site would have one of it's actually more of a type specific piece > of data because some site types can actually have multiple urls, etc... > which depend on what the page input is. So you might as well drop the 3 > url related columns and just use the data blob that you already have. > The $1 pattern may not even work for some sites. For example something > like a gerrit type may want to know a specific root path for gerrit > without any $1 funny business and then handle what actual url gets > output in special ways. ie: So that [[gerrit:14295]] links to > https://gerrit.wikimedia.org/r/#/c/14295 while [[gerrit: > I0a96e58556026d8c923551b07af838ca426a2ab3]] links to > https://gerrit.wikimedia.org/r/#q,I0a96e58556026d8c923551b07af838ca426a2ab3,n,z > >> Cheers >> >> -- >> Jeroen De Dauw >> http://www.bn2vs.com >> Don't panic. Don't be evil. >> -- > > ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name] > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
