On Thu, Mar 7, 2013 at 12:50 PM, Denny Vrandečić <denny.vrande...@wikimedia.de> wrote: > As you probably know, the search in Wikidata sucks big time. > > Until we have created a proper Solr-based search and deployed on that > infrastructure, we would like to implement and set up a reasonable stopgap > solution. > > The simplest and most obvious signal for sorting the items would be to > 1) make a prefix search > 2) weight all results by the number of Wikipedias it links to > > This should usually provide the item you are looking for. Currently, the > search order is random. Good luck with finding items like California, > Wellington, or Berlin. > > Now, what I want to ask is, what would be the appropriate index structure > for that table. The data is saved in the wb_terms table, which would need > to be extended by a "weight" field. There is already a suggestion (based on > discussions between Tim and Daniel K if I understood correctly) to change > the wb_terms table index structure (see here < > https://bugzilla.wikimedia.org/show_bug.cgi?id=45529> ), but since we are > changing the index structure anyway it would be great to get it right this > time. > > Anyone who can jump in? (Looking especially at Asher and Tim) > > Any help would be appreciated. > > Cheers, > Denny > > -- > Project director Wikidata > Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin > Tel. +49-30-219 158 26-0 | http://wikimedia.de > > Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. > Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter > der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für > Körperschaften I Berlin, Steuernummer 27/681/51985. > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
AFAIK sql isn't particularly good for indexing that type of query. You could maybe have a bunch of indexes for the first couple letters of a term, and then after some point hope that things are narrowed down enough that just doing a prefix search is acceptable. For example, you might have an indexes on (wb_term(1), wb_weight), (wb_term(2), wb_weight), ..., (wb_term(7), wb_weight) and one on just wb_term. That way (I believe) you would be able to do efficient searches for a prefix ordered by weight, provided the prefix is less than 7 characters. (7 was chosen arbitrarily out of a hat. Performance goes down as you add more indexes from what I understand. I'm not sure how far you would be able to take this scheme before that becomes an issue. You could maybe enhance this by only showing search suggestion updates for every 2 characters the user enters or something). --bawolff p.s. Have not tested this, and talking a bit outside my knowledge area, so ymmv _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l