Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch
On Sat, Jul 28, 2018 at 2:02 AM Stas Malyshev wrote: > Hi! > > > The top 1000 > > is: > https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing > > This one is pretty interesting, how do I extract this data? It may be > useful independently of what we're discussing here. > This can be extracted from elastic using aggregations, to obtain a top1000 of the terms that do match P21= or P279 you can run this: curl -XPOST 'localhost:9200/wikidatawiki_content/_search?size=0' -d '{"aggs": {"item_usage": { "terms": { "field": "statement_keywords", "exclude": "P(31|279)=.*", "size": 1000 ' > top1k.json To obtain an approximation of the cardinality (unique terms) of a field: curl -XPOST localhost:9200/wikidatawiki_content/_search?size=0 -d '{"aggs": {"item_usage": { "cardinality": { "field": "statement_keywords" ' Note that I used the spare cluster to run these. As for Property usage I just realized that we the outgoing_link which contains a array like: outgoing_link": ["Q1355298","Q1379672","Q15241312","Q8844594","Property:P18" ,"Property:P1889","Property:P248","Property:P2612","Property:P279"," Property:P3221","Property:P3417","Property:P373","Property:P3827"," Property:P577","Property:P646","Property:P910"], We don't have doc values enabled for this one so we can't extract aggregations but if the list of terms is known it could be easily extracted by running X count queries where X is the number of possible possible properties. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch
Hi! > The top 1000 > is: > https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing This one is pretty interesting, how do I extract this data? It may be useful independently of what we're discussing here. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch
Hi! > I think we already index way more than P31 and P279. Oh yes, all the string properties. > So I think that the increase is smaller than what you anticipate. > What I'd try to avoid in general is indexing terms that have only doc > since they are pretty useless. For unique string properties, that would be a frequent occurrence. But I am not sure why it's useless - won't it be a legit use case to look up something by external ID? > I think we should investigate what kind of data we may have here, and at > least for statement_keywords I would not index data that contain random > text (esp. natural language) since they are prone to be unique and > impossible to search. Yes, we definitely should not do that. I tried to exclude such properties but if you notice more of them, let's add them to exclusion config. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch
On Fri, Jul 27, 2018 at 3:31 PM David Causse wrote: > What I'd try to avoid in general is indexing terms that have only doc > since they are pretty useless. > I meant: that have only *one* doc ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch
Hi, I think we already index way more than P31 and P279. For instance we have 102.301.706 (approximation) distinct values in the term lexicon for statement_keywords. Sadly I can't extract the list of unique PIDs used (we'd have to enable field_data on statement_keywords.property). The top 1000 is: https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing I think this is because we not only index statements by PID but also by data type. So I think that the increase is smaller than what you anticipate. What I'd try to avoid in general is indexing terms that have only doc since they are pretty useless. I think we should investigate what kind of data we may have here, and at least for statement_keywords I would not index data that contain random text (esp. natural language) since they are prone to be unique and impossible to search. On Thu, Jul 26, 2018 at 11:48 PM Stas Malyshev wrote: > Hi! > > Today we are indexing in ElasticSearch almost all string properties > (except a few) and select item properties (P31 and P279). We've been > asked to extend this set and index more item properties > (https://phabricator.wikimedia.org/T199884). We did not do it from the > start because we did not want to add too much data to the index at once, > and wanted to see how the index behaves. To evaluate what this change > would mean, some statistics: > > All usage of item properties in statements is about 231 million uses > (according to sqid tool database). Of those, about 50M uses are > "instance of" which we are already indexing. Another 98M uses belong to > two properties - published in (P1433) and cites (P2860). Leaving about > 86M for the rest of the properties. > > So, if we index all the item properties except P2860 and P1433, we'll be > a little more than doubling the amount of data we're storing for this > field, which seems OK. But if we index those too, we'll be essentially > quadrupling it - which may be OK too, but is bigger jump and one that > may potentially cause some issues. > > So, we have two questions: > 1. Do we want to enable indexing for all item properties? Note that if > you just want to find items with certain statement values, Wikidata > Query Service matches this use case best. It's only in combination with > actual fulltext search where on-wiki search is better. > > 2. Do we need to index P2860 and P1433 at all, and if so, would it be ok > if we omit indexing for now? > > Would be glad to hear thoughts on the matter. > > Thanks, > -- > Stas Malyshev > smalys...@wikimedia.org > > ___ > discovery-private mailing list > discovery-priv...@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/discovery-private > ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata