Smalyshev added a comment.

The thing which worries me the most is the non-persistent indexes. Note that 
the real data size would be not 3M nodes but about 20M nodes //and// over 100M 
edges, which all need to be indexed. Or, if we convert edges to nodes for 
indexing, that'd be about 150M nodes and comparable number of edges, with 
indexing mostly going on on nodes. Which means, if we take about 3-4K per node, 
it's about 60-80G of memory not counting the edges. So it would probably run on 
64G+ server, but you can pretty much forget about running it on a 
desktop/non-server machine. Even then, the question is - what if our data size 
doubles?

Now, going for the indexes times, if we take optimistic time of 3s/3M nodes, 
then for full data index we'd have about 30s time. If we have about 2000 
properties to index, that brings us to 16 hours startup time, which doesn't 
sound good. Of course, that is assuming the times add linearly. If we take 
skiplist index times - which may be necessary on some of the properties, as 
they are numbers/quantities/dates, we'd have 200 to 2000 s per index, assuming 
linear scaling, which for thousands of indexed properties looks like it would 
take days to finish. Again, this is all under assumption all the numbers behave 
in linear fashion, for which I have no empiric substantiation. This is why I am 
worried about non-persistent indexes.

If we want to make the scale test, we could find a 64G server and try to load 
some synthetic test with nodes, edges and at least the roughly estimated number 
of indexes. It'd probably require writing a new tool for transforming the data, 
as the current one uses Tinkerpop3 and probably won't work in this case, but 
it's probably possible to do in a couple of days.


TASK DETAIL
  https://phabricator.wikimedia.org/T88549

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Neunhoef, Fceller, JanZerebecki, Aklapper, Manybubbles, jkroll, Smalyshev, 
Wikidata-bugs, aude, GWicke, daniel



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to