Hi,

I think we already index way more than P31 and P279.
For instance we have 102.301.706 (approximation) distinct values in the
term lexicon for statement_keywords.
Sadly I can't extract the list of unique PIDs used (we'd have to enable
field_data on statement_keywords.property).
The top 1000 is:
https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing
I think this is because we not only index statements by PID but also by
data type.
So I think that the increase is smaller than what you anticipate.
What I'd try to avoid in general is indexing terms that have only doc since
they are pretty useless.
I think we should investigate what kind of data we may have here, and at
least for statement_keywords I would not index data that contain random
text (esp. natural language) since they are prone to be unique and
impossible to search.


On Thu, Jul 26, 2018 at 11:48 PM Stas Malyshev <smalys...@wikimedia.org>
wrote:

> Hi!
>
> Today we are indexing in ElasticSearch almost all string properties
> (except a few) and select item properties (P31 and P279). We've been
> asked to extend this set and index more item properties
> (https://phabricator.wikimedia.org/T199884). We did not do it from the
> start because we did not want to add too much data to the index at once,
> and wanted to see how the index behaves. To evaluate what this change
> would mean, some statistics:
>
> All usage of item properties in statements is about 231 million uses
> (according to sqid tool database). Of those, about 50M uses are
> "instance of" which we are already indexing. Another 98M uses belong to
> two properties - published in (P1433) and cites (P2860). Leaving about
> 86M for the rest of the properties.
>
> So, if we index all the item properties except P2860 and P1433, we'll be
> a little more than doubling the amount of data we're storing for this
> field, which seems OK. But if we index those too, we'll be essentially
> quadrupling it - which may be OK too, but is bigger jump and one that
> may potentially cause some issues.
>
> So, we have two questions:
> 1. Do we want to enable indexing for all item properties? Note that if
> you just want to find items with certain statement values, Wikidata
> Query Service matches this use case best. It's only in combination with
> actual fulltext search where on-wiki search is better.
>
> 2. Do we need to index P2860 and P1433 at all, and if so, would it be ok
> if we omit indexing for now?
>
> Would be glad to hear thoughts on the matter.
>
> Thanks,
> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> _______________________________________________
> discovery-private mailing list
> discovery-priv...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/discovery-private
>
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to