Manybubbles added a comment. In https://phabricator.wikimedia.org/T88549#1021733, @Neunhoef wrote:
> I cannot really answer your question, in particular since it will depend on > whether you have only "thousands of hash indexes" or even "thousands of > skiplist indexes", as the above timings suggest. For 16M documents, the > difference between O(1) and O(log(n)) complexity really matters (log(16M) is > about 24, after all...). Furthermore, the actual sparsity of the attribute > values for your indexes will matter. We'd need the indexes that can do range queries - skiplist I presume. Reality dictates sparsity here - we'll be pretty sparse. Properties that only make sense on people <https://www.wikidata.org/wiki/Q23> will rarely be on abstract concepts <https://www.wikidata.org/wiki/Q11471>. There isn't anything from preventing it from time to time, but it should be rare. > Therefore, playing clever tricks to reduce the amount of indexes like you > describe is definitely a good idea. A database engineer (of any flavour), > should at least be a tiny bit scared when he reads "thousands of indexes", > because no DB engine I know of is really happy about this prospect. Cassandra > for example will, as far as I know, duplicate the data many (thousands of?) > times to offer this type of indexing... Lucene handles maintaining lots of indexes quite well. You can't query thousands of indexes at a time (you have to play tricks on that end) but you can maintain thousands of indexes - especially if most documents don't contain the fields. > Furthermore, we have not yet talked about edges. How large is the data about > your 100M edges? Do the edges carry substantial amounts of data themselves? > Is there a sample of this data available online anywhere? Do you need to > index the edges in any way? Please keep in mind that the edge collection will > need at least the "edge-index" of its own... I can't think offhand of any indexes we'll need to edges but Stas probably knows a few. > Finally, for an informed decision about the database engine one would have to > know what kind of queries will hit the database later in production. In > particular for graph-like queries and queries mixing graph- with index > lookups and possibly joins, one has to look carefully to see how they would > perform, in particular with sharding. Do you have any information about the > needed queries for your use case? Lots of stuff. Lots of graph traversal stuff. "List the 10 cities with the most population that have female mayors and are in Europe" <-- currently cities are actually listed as being in counties or regions so we'd either have to flatten that hierarchy or traverse. We'll still have to traverse to the mayor and check its gender. "Find me all of the humans that were born before 1880 and don't have a date of death" "Find me all of the humans who's father doesn't have that human listed as a child" "The mother" "Return the family tree of George Washinton (say we know his id, good old Q23)" "How many humans (instanceOf Q1) do we have data for?" TASK DETAIL https://phabricator.wikimedia.org/T88549 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, Manybubbles Cc: Neunhoef, Fceller, JanZerebecki, Aklapper, Manybubbles, jkroll, Smalyshev, Wikidata-bugs, aude, GWicke, daniel _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
