Interesting questions... my comments are inline. On Tue, Oct 24, 2017 at 8:49 PM, Stas Malyshev <[email protected]> wrote:
> Hi! > > As I am working on improving Wikidata fulltext search[1], I'd like to > talk about search results page. Right now search results page for > Wikidata is less than ideal, here are the issues I see with it: > > - No match highlighting > I think match highlighting would be nice, but I know it can be tricky in the edge cases. > - Meaningless data, like word count (anybody cares to guess what it is > counting? Anybody ever used it?) and byte count (more useful than word > count but not by much) > I don't know who is interested in that, so I don't have a strong opinion. > - Obviously, search quality is not super high, but that should be > improved with proper description indexing > > While working on improving the situation, I would like to solicit > opinions on the set of questions about how the search results page > should look like. Namely: > > 1. If the match is made on label/description that does not match current > display language, we could opt for: > a) Displaying the description that matched, highlighted. Optionally > maybe display the language of the match (in display language?) > b) Displaying the description in display language, un-highlighted. > Which option is preferable? > I would definitely like to see the label that matched. Even if you don't know the language, seeing a partial match vs a full match is informative. If I search for *Москва,* and I get back "Moscow" and "Armenian Cemetery" I don't know what's what. Seeing that Moscow is "Russian: *Москва*" and Armenian Cemetery is "Russian: Армянское кладбище (*Москва*)" tells me immediately that Moscow is probably a better match, even if I don't know any Russian or Cyrillic. There's a problem, though, which may be why this hasn't been done—*which* label do you match? For Armenian Cemetery, both Russian and Ukrainian have "Москва" in the label. For Moscow, there are 18 labels that are "Москва", another one that is a partial match (Москва балһсн), another that's a folded match (Мӧсква), and three more that have exact matches in their additional labels (including English). Unless you can define a hierarchy of languages—possibly including user languages and the "native" language of an entry—it's going to be hard to pick one. If I'd searched for *Moskva* and didn't have English as a user language, it'd be impossible to choose one of the 32 possible languages that are exact matches on the main label. *Moskwa* also doesn't match any of my user languages, or Russian, but does match a bunch of other languages—how to choose? Any names will have similar problems. "Jacek Moskwa" is the same in all 12 languages with a label. His descriptions say he's Polish, so I guess Polish is the right answer, but I don't think there's any way to know that. So, ideally, *I'd* like the name of the the language that had a label match in my display language, with a highlight of the matching bit in the description from the matched language—but I'm not sure there's a way to get there. Picking the first one alphabetically that matches will give weird results. > > 2. What we do if the match is on alias? Do we display matching alias, > original label or both? The question above also applies if the match is > on other language alias. > I'd want to see the both, maybe as "West Germany (*FRG*)" if I search for FRG—hey, the autocompletion suggester does that already! > 3. It looks clear to me that words count is useless. Is byte count > useful and does it need to be kept? > > 4. Do we want to display any other parameters of the entity? E.g. we > have in the index: statement_count, sitelink_count, label_count, > incoming_links, etc. Do we want to display any? > Statement count is the one that is most interesting to me, but I wonder if anyone really uses any of these stats. Someone must, but I don't know their use cases. > > 5. Display format for Wikidata and for other wikipedia sites is different: > Wikpedia: > > Title > Snippet > > Wikidata: > > Title: Description > > I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on > the same line, separated by colon. Is there any reason for this > difference? Do we want to go back to the common format? > I can see that "Title: Description" saves some vertical space, but I would prefer the description to be on the next line. > > Also if you have any other things/ideas/comments about how fulltext > search output for wikidata should be, please tell me. > Since Moscow has Москва as an additional label in English, I'm not sure if I'd also want to see a line with "Russian: Москва", too, so I left it out and used just the English alias for the city. I also got tired of counting statements on the city, so I just made something up. Moscow (*Москва*) (Q649) <https://www.wikidata.org/wiki/Q649> capital city and the largest city of Russia; separate federal subject of Russia 386 KB (537 statements) - 08:33, 15 October 2017 Moskva River (Q175117) <https://www.wikidata.org/wiki/Q175117> Russian: *Москва* river in Moscow and Moscow region 40 KB (31 statements) - 14:21, 25 September 2017 FC Moscow (Q392115) <https://www.wikidata.org/wiki/Q392115> Russian: *Москва* association football club 18 KB (12 statements) - 15:35, 17 October 2017 Moscow 24 (Q1572348) <https://www.wikidata.org/wiki/Q1572348> Russian: *Москва* 24 television channel 9 KB (14 statements) - 06:13, 11 June 2017 Armenian Cemetery (Q685338) <https://www.wikidata.org/wiki/Q685338> Russian: Армянское кладбище (*Москва*) cemetery 8 KB (7 statements) - 10:07, 2 September 2017 ... although pulling out the Russian specifically is probably not possible. You've set yourself a complicated task!! > I am sending this to wikidata-tech and discovery team list only for now, > since it's still work in progress and half-baked, we could open this for > wider discussion later if necessary. > > [1] https://phabricator.wikimedia.org/T178851 > > Thanks, > -- > Stas Malyshev > [email protected] > > > Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation
_______________________________________________ Wikidata-tech mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
