Re: [Wikidata-tech] How to record redirects in the database

Katie Filbert Tue, 15 Jul 2014 05:52:26 -0700

I think option #3 is best, since the table isn't *that* big (only one row
per page) and seems least bad option in terms of code design.


Cheers,
Katie


On Mon, Jul 14, 2014 at 12:30 PM, Daniel Kinzler <
[email protected]> wrote:

> This is about the question of how to best store entity redirects in the
> database. Below I try to describe the problem and the possible solutions.
> Any
> input welcome.
>
>
> A quick primer:
>
> Wikibase redirects are really Entity-ID aliases. They correspond to, but
> are not
> the same as, MediaWiki's page based redirects.
>
> If Q3 is a redirect to (an alias for) Q5, the page Item:Q3 is also a
> redirect to
> Item:Q5. The JSON blob on Item:Q3 would store a redirect entry *instead*
> of an
> entity. Entities never *are* redirects.
>
> Wikibase currently stores a mapping of entity ids to page ids in the
> wb_entity_per_page table (epp table for short).
>
> MediaWiki core stores redirects as a kind of "link table", with rd_from
> being a
> page_id, and rd_to+rd_namespace being name+namespace of the redirect
> target.
>
>
> Requirements:
>
> * When looking up an EntityId, we need to be able to load the
> corresponding JSON
> blob, and for that we need to find the corresponding wiki page (either by
> id, or
> by name+namespace). We need to be able to do this cross-wiki, so we may
> not have
> the repo's configuration (wrt namespaces, etc) available when constructing
> the
> query.
>
> * We need an efficient way to list all entity IDs on a wiki (without
> redirects).
> In particular, the mechanism for listing entities must support efficient
> paging.
>
> * We need an efficient way to resolve redirects in bulk, or at least, to
> discern
> redirects from unknown/deleted entity ids.
>
>
> Options:
>
> 1) No redirects in the epp table (current). This means we need to use the
> name+namespace when loading the entity-or-redirect from a page, since we
> don't
> know the page ID if it's a redirects. We also can't use core's redirect
> table,
> because for that, we also need to know the page id first. In order to use
> name+namespace for looking up page IDs for entities, client wikis would
> need to
> know the namespace IDs used on the repo, in order to generate queries
> against
> the repo's database.
>
> 2) Put redirects into the epp table as well, without any special marking.
> This
> makes lookups easy, but gives us no efficient way to list all entities
> without
> redirects. We'd need to check and skip redirects while iterating. This
> would add
> complexity to several maintenance and upgrade scripts.
>
> 3) Put redirects into the epp table, with a marker (or target id) in a new
> column. This would allow for both, simple lookup and efficient listing,
> but it
> means adding a column (and an index) to an already large table in
> production. It
> also means having the overhead of a column that's mostly null.
>
> 4) Put redirects into epp *and* a separate table. Provides simple lookup,
> but
> means a potentially slow join when listing entities. This join would happen
> multiple times each time we need to list all entities, because of paged
> access -
> compare how JsonDumpGenerator works.
>
> 5) Put redirects into a special table but not into epp. This means
> fast/simple
> listing of entities, but requires a not-so-nice "try" logic when looking up
> entities: if no entry is found in the epp table, we then need to go on and
> try
> the entity-redirect table, to see whether the id is redirected or
> unknown/deleted.
>
>
> Assessment:
>
> 1) is nasty in terms of cross-wiki configuration. It's the simplest
> solution on
> the code and database levels, but seems brittle.
>
> 2) adds complexity to everything that lists entities. Big performance
> impact in
> cases where entity blobs would otherwise not have been loaded, but are
> loaded
> now to check whether they contain redirects.
>
> 3) is somewhat wasteful on the database level, and needs a schema change
> deployment on a large table. Don't know how bad that would be, though.
>
> 4) may cause performance issues because it adds complexity to big queries
> on
> large tables. Needs trivial schema change deployment (new table).
>
> 5) adds complexity to the code that reads entity blobs from the database,
> impacts performance for the "redirect" and "missing entity" cases by
> adding a
> database query. Could be acceptable if these cases are rare. Needs trivial
> schema change deployment (new table).
>
> -- daniel
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> _______________________________________________
> Wikidata-tech mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
>



-- 
Katie Filbert
Wikidata Developer

Wikimedia Germany e.V. | Tempelhofer Ufer 23-24, 10963 Berlin
Phone (030) 219 158 26-0

http://wikimedia.de

Wikimedia Germany - Society for the Promotion of free knowledge eV Entered
in the register of Amtsgericht Berlin-Charlottenburg under the number 23
855 as recognized as charitable by the Inland Revenue for corporations I
Berlin, tax number 27/681/51985.

_______________________________________________
Wikidata-tech mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] How to record redirects in the database

Reply via email to