I think option #3 is best, since the table isn't *that* big (only one row per page) and seems least bad option in terms of code design.
Cheers, Katie On Mon, Jul 14, 2014 at 12:30 PM, Daniel Kinzler < [email protected]> wrote: > This is about the question of how to best store entity redirects in the > database. Below I try to describe the problem and the possible solutions. > Any > input welcome. > > > A quick primer: > > Wikibase redirects are really Entity-ID aliases. They correspond to, but > are not > the same as, MediaWiki's page based redirects. > > If Q3 is a redirect to (an alias for) Q5, the page Item:Q3 is also a > redirect to > Item:Q5. The JSON blob on Item:Q3 would store a redirect entry *instead* > of an > entity. Entities never *are* redirects. > > Wikibase currently stores a mapping of entity ids to page ids in the > wb_entity_per_page table (epp table for short). > > MediaWiki core stores redirects as a kind of "link table", with rd_from > being a > page_id, and rd_to+rd_namespace being name+namespace of the redirect > target. > > > Requirements: > > * When looking up an EntityId, we need to be able to load the > corresponding JSON > blob, and for that we need to find the corresponding wiki page (either by > id, or > by name+namespace). We need to be able to do this cross-wiki, so we may > not have > the repo's configuration (wrt namespaces, etc) available when constructing > the > query. > > * We need an efficient way to list all entity IDs on a wiki (without > redirects). > In particular, the mechanism for listing entities must support efficient > paging. > > * We need an efficient way to resolve redirects in bulk, or at least, to > discern > redirects from unknown/deleted entity ids. > > > Options: > > 1) No redirects in the epp table (current). This means we need to use the > name+namespace when loading the entity-or-redirect from a page, since we > don't > know the page ID if it's a redirects. We also can't use core's redirect > table, > because for that, we also need to know the page id first. In order to use > name+namespace for looking up page IDs for entities, client wikis would > need to > know the namespace IDs used on the repo, in order to generate queries > against > the repo's database. > > 2) Put redirects into the epp table as well, without any special marking. > This > makes lookups easy, but gives us no efficient way to list all entities > without > redirects. We'd need to check and skip redirects while iterating. This > would add > complexity to several maintenance and upgrade scripts. > > 3) Put redirects into the epp table, with a marker (or target id) in a new > column. This would allow for both, simple lookup and efficient listing, > but it > means adding a column (and an index) to an already large table in > production. It > also means having the overhead of a column that's mostly null. > > 4) Put redirects into epp *and* a separate table. Provides simple lookup, > but > means a potentially slow join when listing entities. This join would happen > multiple times each time we need to list all entities, because of paged > access - > compare how JsonDumpGenerator works. > > 5) Put redirects into a special table but not into epp. This means > fast/simple > listing of entities, but requires a not-so-nice "try" logic when looking up > entities: if no entry is found in the epp table, we then need to go on and > try > the entity-redirect table, to see whether the id is redirected or > unknown/deleted. > > > Assessment: > > 1) is nasty in terms of cross-wiki configuration. It's the simplest > solution on > the code and database levels, but seems brittle. > > 2) adds complexity to everything that lists entities. Big performance > impact in > cases where entity blobs would otherwise not have been loaded, but are > loaded > now to check whether they contain redirects. > > 3) is somewhat wasteful on the database level, and needs a schema change > deployment on a large table. Don't know how bad that would be, though. > > 4) may cause performance issues because it adds complexity to big queries > on > large tables. Needs trivial schema change deployment (new table). > > 5) adds complexity to the code that reads entity blobs from the database, > impacts performance for the "redirect" and "missing entity" cases by > adding a > database query. Could be acceptable if these cases are rare. Needs trivial > schema change deployment (new table). > > -- daniel > > -- > Daniel Kinzler > Senior Software Developer > > Wikimedia Deutschland > Gesellschaft zur Förderung Freien Wissens e.V. > > _______________________________________________ > Wikidata-tech mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikidata-tech > -- Katie Filbert Wikidata Developer Wikimedia Germany e.V. | Tempelhofer Ufer 23-24, 10963 Berlin Phone (030) 219 158 26-0 http://wikimedia.de Wikimedia Germany - Society for the Promotion of free knowledge eV Entered in the register of Amtsgericht Berlin-Charlottenburg under the number 23 855 as recognized as charitable by the Inland Revenue for corporations I Berlin, tax number 27/681/51985.
_______________________________________________ Wikidata-tech mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
