Re: [Wikitech-l] Indexing structures for Wikidata

bawolff Fri, 08 Mar 2013 10:18:09 -0800

On Thu, Mar 7, 2013 at 12:50 PM, Denny Vrandečić
<[email protected]> wrote:
> As you probably know, the search in Wikidata sucks big time.
>
> Until we have created a proper Solr-based search and deployed on that
> infrastructure, we would like to implement and set up a reasonable stopgap
> solution.
>
> The simplest and most obvious signal for sorting the items would be to
> 1) make a prefix search
> 2) weight all results by the number of Wikipedias it links to
>
> This should usually provide the item you are looking for. Currently, the
> search order is random. Good luck with finding items like California,
> Wellington, or Berlin.
>
> Now, what I want to ask is, what would be the appropriate index structure
> for that table. The data is saved in the wb_terms table, which would need
> to be extended by a "weight" field. There is already a suggestion (based on
> discussions between Tim and Daniel K if I understood correctly) to change
> the wb_terms table index structure (see here <
> https://bugzilla.wikimedia.org/show_bug.cgi?id=45529> ), but since we are
> changing the index structure anyway it would be great to get it right this
> time.
>
> Anyone who can jump in? (Looking especially at Asher and Tim)
>
> Any help would be appreciated.
>
> Cheers,
> Denny
>
> --
> Project director Wikidata
> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
> der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
> Körperschaften I Berlin, Steuernummer 27/681/51985.
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


AFAIK sql isn't particularly good for indexing that type of query.

You could maybe have a bunch of indexes for the first couple letters
of a term, and then after some point hope that things are narrowed
down enough that just doing a prefix search is acceptable. For
example, you might have an indexes on (wb_term(1), wb_weight),
(wb_term(2), wb_weight), ..., (wb_term(7), wb_weight) and one on just
wb_term. That way (I believe) you would be able to do efficient
searches for a prefix ordered by weight, provided the prefix is less
than 7 characters. (7 was chosen arbitrarily out of a hat. Performance
goes down as you add more indexes from what I understand. I'm not sure
how far you would be able to take this scheme before that becomes an
issue. You could maybe enhance this by only showing search suggestion
updates for every 2 characters the user enters or something).

--bawolff

p.s. Have not tested this, and talking a bit outside my knowledge area, so ymmv

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Indexing structures for Wikidata

Reply via email to