Neunhoef added a comment. > Do you think sparse indexes are going to give us enough performance thousands > of these indexes with similar startup times to what we see now? Is that > something we can hack around by using fewer indexes and searching them in > more interesting ways? Say we just make one index for all the badges and for > every document with a badge we index an entry for its wiki, the badge name, > and its wiki_its badge name. That way I can query all entries with "enwiki" > badges. Or all entries with "featured" badges. Or all entries with > "enwiki_featured" badges. We might be able to play similar clever tricks > with the attributes.
First of all, a sparse index will only consume memory for those documents which have the indexed attribute set. This is what makes it feasible at all to have "thousands of indexes". Secondly, inserting a document in thousands of indexes can take considerable time (see the timings above), however, a sparse index has to do nothing for a document that has its attribute value unset. So also for the insertion time the sparse indexes are needed. I cannot really answer your question, in particular since it will depend on whether you have only "thousands of hash indexes" or even "thousands of skiplist indexes", as the above timings suggest. For 16M documents, the difference between O(1) and O(log(n)) complexity really matters (log(16M) is about 24, after all...). Furthermore, the actual sparsity of the attribute values for your indexes will matter. Therefore, playing clever tricks to reduce the amount of indexes like you describe is definitely a good idea. A database engineer (of any flavour), should at least be a tiny bit scared when he reads "thousands of indexes", because no DB engine I know of is really happy about this prospect. Cassandra for example will, as far as I know, duplicate the data many (thousands of?) times to offer this type of indexing... Furthermore, we have not yet talked about edges. How large is the data about your 100M edges? Do the edges carry substantial amounts of data themselves? Is there a sample of this data available online anywhere? Do you need to index the edges in any way? Please keep in mind that the edge collection will need at least the "edge-index" of its own... Finally, for an informed decision about the database engine one would have to know what kind of queries will hit the database later in production. In particular for graph-like queries and queries mixing graph- with index lookups and possibly joins, one has to look carefully to see how they would perform, in particular with sharding. Do you have any information about the needed queries for your use case? TASK DETAIL https://phabricator.wikimedia.org/T88549 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, Neunhoef Cc: Neunhoef, Fceller, JanZerebecki, Aklapper, Manybubbles, jkroll, Smalyshev, Wikidata-bugs, aude, GWicke, daniel _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
