Smalyshev added a comment. The thing which worries me the most is the non-persistent indexes. Note that the real data size would be not 3M nodes but about 20M nodes //and// over 100M edges, which all need to be indexed. Or, if we convert edges to nodes for indexing, that'd be about 150M nodes and comparable number of edges, with indexing mostly going on on nodes. Which means, if we take about 3-4K per node, it's about 60-80G of memory not counting the edges. So it would probably run on 64G+ server, but you can pretty much forget about running it on a desktop/non-server machine. Even then, the question is - what if our data size doubles?
Now, going for the indexes times, if we take optimistic time of 3s/3M nodes, then for full data index we'd have about 30s time. If we have about 2000 properties to index, that brings us to 16 hours startup time, which doesn't sound good. Of course, that is assuming the times add linearly. If we take skiplist index times - which may be necessary on some of the properties, as they are numbers/quantities/dates, we'd have 200 to 2000 s per index, assuming linear scaling, which for thousands of indexed properties looks like it would take days to finish. Again, this is all under assumption all the numbers behave in linear fashion, for which I have no empiric substantiation. This is why I am worried about non-persistent indexes. If we want to make the scale test, we could find a 64G server and try to load some synthetic test with nodes, edges and at least the roughly estimated number of indexes. It'd probably require writing a new tool for transforming the data, as the current one uses Tinkerpop3 and probably won't work in this case, but it's probably possible to do in a couple of days. TASK DETAIL https://phabricator.wikimedia.org/T88549 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev Cc: Neunhoef, Fceller, JanZerebecki, Aklapper, Manybubbles, jkroll, Smalyshev, Wikidata-bugs, aude, GWicke, daniel _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
