Re: [Wikidata-tech] Wikidata fulltext search results output

2017-10-25 Thread Stas Malyshev
Hi!

> while you are at it, some things would be very useful to be search-able
> (maybe some are already by now):
> * "primary" (not references/qualifiers) years, for birth/death/flourit etc.
> * "primary" string/monolingual values (title, taxon name, etc.)
> * "primary" IDs, e.g. VIAF (might cause confusion with years, so maybe
> only add numerical IDs if 5+ digits?)

We have the code to index statements already, and we're already indexing
P31 and P279. We could index more properties. We don't have syntax or
any other way though to actually use those in search - yet, except for
boosting (see https://gerrit.wikimedia.org/r/#/c/384632/).

We're looking at which properties to add (nominations welcome, probably
in the form of phab ticket?) - since adding them requires full reindex
of wikidata (couple of days) we probably don't want to add them one by
one but want to collect a set and then do it in one hit.

We also do not have syntax for searching (as in match, instead of boost)
by statement values, but it should not be hard - we just need to design
proper syntax and implement it (syntaxes are now pluggable, so should
not be too big of a problem).

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] [discovery-private] Wikidata fulltext search results output

2017-10-25 Thread Trey Jones
Interesting questions... my comments are inline.

On Tue, Oct 24, 2017 at 8:49 PM, Stas Malyshev 
wrote:

> Hi!
>
> As I am working on improving Wikidata fulltext search[1], I'd like to
> talk about search results page. Right now search results page for
> Wikidata is less than ideal, here are the issues I see with it:
>
> - No match highlighting
>

I think match highlighting would be nice, but I know it can be tricky in
the edge cases.


> - Meaningless data, like word count (anybody cares to guess what it is
> counting? Anybody ever used it?) and byte count (more useful than word
> count but not by much)
>

I don't know who is interested in that, so I don't have a strong opinion.


> - Obviously, search quality is not super high, but that should be
> improved with proper description indexing
>
> While working on improving the situation, I would like to solicit
> opinions on the set of questions about how the search results page
> should look like. Namely:
>
> 1. If the match is made on label/description that does not match current
> display language, we could opt for:
> a) Displaying the description that matched, highlighted. Optionally
> maybe display the language of the match (in display language?)
> b) Displaying the description in display language, un-highlighted.
> Which option is preferable?
>

I would definitely like to see the label that matched. Even if you don't
know the language, seeing a partial match vs a full match is informative.
If I search for *Москва,* and I get back "Moscow" and "Armenian Cemetery" I
don't know what's what. Seeing that Moscow is "Russian: *Москва*" and
Armenian Cemetery is "Russian: Армянское кладбище (*Москва*)" tells me
immediately that Moscow is probably a better match, even if I don't know
any Russian or Cyrillic.

There's a problem, though, which may be why this hasn't been done—*which* label
do you match? For Armenian Cemetery, both Russian and Ukrainian have "Москва"
in the label. For Moscow, there are 18 labels that are "Москва", another
one that is a partial match (Москва балһсн), another that's a folded
match (Мӧсква),
and three more that have exact matches in their additional labels
(including English). Unless you can define a hierarchy of
languages—possibly including user languages and the "native" language of an
entry—it's going to be hard to pick one. If I'd searched for *Moskva* and
didn't have English as a user language, it'd be impossible to choose one of
the 32 possible languages that are exact matches on the main label.
*Moskwa* also
doesn't match any of my user languages, or Russian, but does match a bunch
of other languages—how to choose?

Any names will have similar problems. "Jacek Moskwa" is the same in all 12
languages with a label. His descriptions say he's Polish, so I guess Polish
is the right answer, but I don't think there's any way to know that.

So, ideally, *I'd* like the name of the the language that had a label match
in my display language, with a highlight of the matching bit in the
description from the matched language—but I'm not sure there's a way to get
there. Picking the first one alphabetically that matches will give weird
results.


>
> 2. What we do if the match is on alias? Do we display matching alias,
> original label or both? The question above also applies if the match is
> on other language alias.
>

I'd want to see the both, maybe as "West Germany (*FRG*)" if I search for
FRG—hey, the autocompletion suggester does that already!


> 3. It looks clear to me that words count is useless. Is byte count
> useful and does it need to be kept?
>
> 4. Do we want to display any other parameters of the entity? E.g. we
> have in the index: statement_count, sitelink_count, label_count,
> incoming_links, etc. Do we want to display any?
>

Statement count is the one that is most interesting to me, but I wonder if
anyone really uses any of these stats. Someone must, but I don't know their
use cases.


>
> 5. Display format for Wikidata and for other wikipedia sites is different:
> Wikpedia:
>
> Title
> Snippet
>
> Wikidata:
>
> Title: Description
>
> I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on
> the same line, separated by colon. Is there any reason for this
> difference? Do we want to go back to the common format?
>

I can see that "Title: Description" saves some vertical space, but I would
prefer the description to be on the next line.


>
> Also if you have any other things/ideas/comments about how fulltext
> search output for wikidata should be, please tell me.
>

Since Moscow has Москва as an additional label in English, I'm not sure if
I'd also want to see a line with "Russian: Москва", too, so I left it out
and used just the English alias for the city. I also got tired of counting
statements on the city, so I just made something up.

Moscow (*Москва*) (Q649) 
capital city and the largest city of Russia; separate federal subject of
Russia

Re: [Wikidata-tech] Wikidata fulltext search results output

2017-10-25 Thread Magnus Manske
Hi Stas,

while you are at it, some things would be very useful to be search-able
(maybe some are already by now):
* "primary" (not references/qualifiers) years, for birth/death/flourit etc.
* "primary" string/monolingual values (title, taxon name, etc.)
* "primary" IDs, e.g. VIAF (might cause confusion with years, so maybe only
add numerical IDs if 5+ digits?)

Cheers,
Magnus

On Wed, Oct 25, 2017 at 1:50 AM Stas Malyshev 
wrote:

> Hi!
>
> As I am working on improving Wikidata fulltext search[1], I'd like to
> talk about search results page. Right now search results page for
> Wikidata is less than ideal, here are the issues I see with it:
>
> - No match highlighting
> - Meaningless data, like word count (anybody cares to guess what it is
> counting? Anybody ever used it?) and byte count (more useful than word
> count but not by much)
> - Obviously, search quality is not super high, but that should be
> improved with proper description indexing
>
> While working on improving the situation, I would like to solicit
> opinions on the set of questions about how the search results page
> should look like. Namely:
>
> 1. If the match is made on label/description that does not match current
> display language, we could opt for:
> a) Displaying the description that matched, highlighted. Optionally
> maybe display the language of the match (in display language?)
> b) Displaying the description in display language, un-highlighted.
> Which option is preferable?
>
> 2. What we do if the match is on alias? Do we display matching alias,
> original label or both? The question above also applies if the match is
> on other language alias.
>
> 3. It looks clear to me that words count is useless. Is byte count
> useful and does it need to be kept?
>
> 4. Do we want to display any other parameters of the entity? E.g. we
> have in the index: statement_count, sitelink_count, label_count,
> incoming_links, etc. Do we want to display any?
>
> 5. Display format for Wikidata and for other wikipedia sites is different:
> Wikpedia:
>
> Title
> Snippet
>
> Wikidata:
>
> Title: Description
>
> I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on
> the same line, separated by colon. Is there any reason for this
> difference? Do we want to go back to the common format?
>
> Also if you have any other things/ideas/comments about how fulltext
> search output for wikidata should be, please tell me.
>
> I am sending this to wikidata-tech and discovery team list only for now,
> since it's still work in progress and half-baked, we could open this for
> wider discussion later if necessary.
>
> [1] https://phabricator.wikimedia.org/T178851
>
> Thanks,
> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> ___
> Wikidata-tech mailing list
> Wikidata-tech@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
>
___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech