Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch

2018-07-28 Thread David Causse
On Sat, Jul 28, 2018 at 2:02 AM Stas Malyshev 
wrote:

> Hi!
>
> > The top 1000
> > is:
> https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing
>
> This one is pretty interesting, how do I extract this data? It may be
> useful independently of what we're discussing here.
>

This can be extracted from elastic using aggregations, to obtain a top1000
of the terms that do match P21= or P279 you can run this:
 curl -XPOST 'localhost:9200/wikidatawiki_content/_search?size=0' -d
'{"aggs": {"item_usage": { "terms": { "field": "statement_keywords",
"exclude": "P(31|279)=.*", "size": 1000 ' > top1k.json

To obtain an approximation of the cardinality (unique terms) of a field:

curl -XPOST localhost:9200/wikidatawiki_content/_search?size=0 -d '{"aggs":
{"item_usage": { "cardinality": { "field": "statement_keywords" '

Note that I used the spare cluster to run these.
As for Property usage I just realized that we the outgoing_link which
contains a array like:
outgoing_link": ["Q1355298","Q1379672","Q15241312","Q8844594","Property:P18"
,"Property:P1889","Property:P248","Property:P2612","Property:P279","
Property:P3221","Property:P3417","Property:P373","Property:P3827","
Property:P577","Property:P646","Property:P910"],
We don't have doc values enabled for this one so we can't extract
aggregations but if the list of terms is known it could be easily extracted
by running X count queries where X is the number of possible possible
properties.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch

2018-07-27 Thread Stas Malyshev
Hi!

> The top 1000
> is: 
> https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing

This one is pretty interesting, how do I extract this data? It may be
useful independently of what we're discussing here.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch

2018-07-27 Thread Stas Malyshev
Hi!

> I think we already index way more than P31 and P279.

Oh yes, all the string properties.

> So I think that the increase is smaller than what you anticipate.
> What I'd try to avoid in general is indexing terms that have only doc
> since they are pretty useless.

For unique string properties, that would be a frequent occurrence. But I
am not sure why it's useless - won't it be a legit use case to look up
something by external ID?

> I think we should investigate what kind of data we may have here, and at
> least for statement_keywords I would not index data that contain random
> text (esp. natural language) since they are prone to be unique and
> impossible to search. 

Yes, we definitely should not do that. I tried to exclude such
properties but if you notice more of them, let's add them to exclusion
config.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch

2018-07-27 Thread David Causse
On Fri, Jul 27, 2018 at 3:31 PM David Causse  wrote:

> What I'd try to avoid in general is indexing terms that have only doc
> since they are pretty useless.
>

I meant:  that have only *one* doc
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch

2018-07-27 Thread David Causse
Hi,

I think we already index way more than P31 and P279.
For instance we have 102.301.706 (approximation) distinct values in the
term lexicon for statement_keywords.
Sadly I can't extract the list of unique PIDs used (we'd have to enable
field_data on statement_keywords.property).
The top 1000 is:
https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing
I think this is because we not only index statements by PID but also by
data type.
So I think that the increase is smaller than what you anticipate.
What I'd try to avoid in general is indexing terms that have only doc since
they are pretty useless.
I think we should investigate what kind of data we may have here, and at
least for statement_keywords I would not index data that contain random
text (esp. natural language) since they are prone to be unique and
impossible to search.


On Thu, Jul 26, 2018 at 11:48 PM Stas Malyshev 
wrote:

> Hi!
>
> Today we are indexing in ElasticSearch almost all string properties
> (except a few) and select item properties (P31 and P279). We've been
> asked to extend this set and index more item properties
> (https://phabricator.wikimedia.org/T199884). We did not do it from the
> start because we did not want to add too much data to the index at once,
> and wanted to see how the index behaves. To evaluate what this change
> would mean, some statistics:
>
> All usage of item properties in statements is about 231 million uses
> (according to sqid tool database). Of those, about 50M uses are
> "instance of" which we are already indexing. Another 98M uses belong to
> two properties - published in (P1433) and cites (P2860). Leaving about
> 86M for the rest of the properties.
>
> So, if we index all the item properties except P2860 and P1433, we'll be
> a little more than doubling the amount of data we're storing for this
> field, which seems OK. But if we index those too, we'll be essentially
> quadrupling it - which may be OK too, but is bigger jump and one that
> may potentially cause some issues.
>
> So, we have two questions:
> 1. Do we want to enable indexing for all item properties? Note that if
> you just want to find items with certain statement values, Wikidata
> Query Service matches this use case best. It's only in combination with
> actual fulltext search where on-wiki search is better.
>
> 2. Do we need to index P2860 and P1433 at all, and if so, would it be ok
> if we omit indexing for now?
>
> Would be glad to hear thoughts on the matter.
>
> Thanks,
> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> ___
> discovery-private mailing list
> discovery-priv...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/discovery-private
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata