Interesting questions... my comments are inline.

On Tue, Oct 24, 2017 at 8:49 PM, Stas Malyshev <[email protected]>
wrote:

> Hi!
>
> As I am working on improving Wikidata fulltext search[1], I'd like to
> talk about search results page. Right now search results page for
> Wikidata is less than ideal, here are the issues I see with it:
>
> - No match highlighting
>

I think match highlighting would be nice, but I know it can be tricky in
the edge cases.


> - Meaningless data, like word count (anybody cares to guess what it is
> counting? Anybody ever used it?) and byte count (more useful than word
> count but not by much)
>

I don't know who is interested in that, so I don't have a strong opinion.


> - Obviously, search quality is not super high, but that should be
> improved with proper description indexing
>
> While working on improving the situation, I would like to solicit
> opinions on the set of questions about how the search results page
> should look like. Namely:
>
> 1. If the match is made on label/description that does not match current
> display language, we could opt for:
> a) Displaying the description that matched, highlighted. Optionally
> maybe display the language of the match (in display language?)
> b) Displaying the description in display language, un-highlighted.
> Which option is preferable?
>

I would definitely like to see the label that matched. Even if you don't
know the language, seeing a partial match vs a full match is informative.
If I search for *Москва,* and I get back "Moscow" and "Armenian Cemetery" I
don't know what's what. Seeing that Moscow is "Russian: *Москва*" and
Armenian Cemetery is "Russian: Армянское кладбище (*Москва*)" tells me
immediately that Moscow is probably a better match, even if I don't know
any Russian or Cyrillic.

There's a problem, though, which may be why this hasn't been done—*which* label
do you match? For Armenian Cemetery, both Russian and Ukrainian have "Москва"
in the label. For Moscow, there are 18 labels that are "Москва", another
one that is a partial match (Москва балһсн), another that's a folded
match (Мӧсква),
and three more that have exact matches in their additional labels
(including English). Unless you can define a hierarchy of
languages—possibly including user languages and the "native" language of an
entry—it's going to be hard to pick one. If I'd searched for *Moskva* and
didn't have English as a user language, it'd be impossible to choose one of
the 32 possible languages that are exact matches on the main label.
*Moskwa* also
doesn't match any of my user languages, or Russian, but does match a bunch
of other languages—how to choose?

Any names will have similar problems. "Jacek Moskwa" is the same in all 12
languages with a label. His descriptions say he's Polish, so I guess Polish
is the right answer, but I don't think there's any way to know that.

So, ideally, *I'd* like the name of the the language that had a label match
in my display language, with a highlight of the matching bit in the
description from the matched language—but I'm not sure there's a way to get
there. Picking the first one alphabetically that matches will give weird
results.


>
> 2. What we do if the match is on alias? Do we display matching alias,
> original label or both? The question above also applies if the match is
> on other language alias.
>

I'd want to see the both, maybe as "West Germany (*FRG*)" if I search for
FRG—hey, the autocompletion suggester does that already!


> 3. It looks clear to me that words count is useless. Is byte count
> useful and does it need to be kept?
>
> 4. Do we want to display any other parameters of the entity? E.g. we
> have in the index: statement_count, sitelink_count, label_count,
> incoming_links, etc. Do we want to display any?
>

Statement count is the one that is most interesting to me, but I wonder if
anyone really uses any of these stats. Someone must, but I don't know their
use cases.


>
> 5. Display format for Wikidata and for other wikipedia sites is different:
> Wikpedia:
>
> Title
> Snippet
>
> Wikidata:
>
> Title: Description
>
> I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on
> the same line, separated by colon. Is there any reason for this
> difference? Do we want to go back to the common format?
>

I can see that "Title: Description" saves some vertical space, but I would
prefer the description to be on the next line.


>
> Also if you have any other things/ideas/comments about how fulltext
> search output for wikidata should be, please tell me.
>

Since Moscow has Москва as an additional label in English, I'm not sure if
I'd also want to see a line with "Russian: Москва", too, so I left it out
and used just the English alias for the city. I also got tired of counting
statements on the city, so I just made something up.

Moscow (*Москва*) (Q649) <https://www.wikidata.org/wiki/Q649>
capital city and the largest city of Russia; separate federal subject of
Russia
386 KB (537 statements) - 08:33, 15 October 2017

Moskva River (Q175117) <https://www.wikidata.org/wiki/Q175117>
Russian: *Москва*
river in Moscow and Moscow region
40 KB (31 statements) - 14:21, 25 September 2017

FC Moscow (Q392115) <https://www.wikidata.org/wiki/Q392115>
Russian: *Москва*
association football club
18 KB (12 statements) - 15:35, 17 October 2017

Moscow 24 (Q1572348) <https://www.wikidata.org/wiki/Q1572348>
Russian: *Москва* 24
television channel
9 KB (14 statements) - 06:13, 11 June 2017

Armenian Cemetery (Q685338) <https://www.wikidata.org/wiki/Q685338>
Russian: Армянское кладбище (*Москва*)
cemetery
8 KB (7 statements) - 10:07, 2 September 2017

... although pulling out the Russian specifically is probably not possible.

You've set yourself a complicated task!!



> I am sending this to wikidata-tech and discovery team list only for now,
> since it's still work in progress and half-baked, we could open this for
> wider discussion later if necessary.
>
> [1] https://phabricator.wikimedia.org/T178851
>
> Thanks,
> --
> Stas Malyshev
> [email protected]
>
>
>
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
_______________________________________________
Wikidata-tech mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Reply via email to