[Wikidata-bugs] [Maniphest] [Commented On] T88549: Investigate ArangoDB for Wikidata Query

Fceller Sat, 07 Feb 2015 05:53:27 -0800

Fceller added a comment.

Maybe we can take a step back and ignore the ArangoDB specifics for the moment. 
I'm also organising NoSQL conferences and consulting NoSQL in general.


Still, I must admit that I'm not familiar with the internal data model of 
Wikipedia. I've checked with George Washington (Q23) that he as a lot of 
properties associated with him. However, I fail to see how the traversals you 
mentioned are defined. For example "Give me the list of countries sorted by 
population?". How does the data model look like? "population" is an attribute 
of "country"? All countries are connected to a "special" node via an edge? Or 
are country identified by a special property? If there is a special "world" 
node containing edges to all countries, then there is no need for indexes. If 
there is a "world" node connecting everything and you need to filter the edges, 
indexes might help.

In general, graph model are very useful if you have paths of different length 
occurring in your query. For example, find a descendent with a given property. 
On the other hand, if your path always has a fixed length, it will be much 
faster to use some sort of indexes. Graph queries are fast, if you have natural 
start node. If you have to find nodes with a given property, it is much better 
to use document databases (see you example "find a city with a female mayor". 
Sometimes it is possible to combine both approaches. For examples, find cities 
with female mayors and then do a traversal from these cities. That is what I 
coin "multi-model". To be able to switch between models in an query. It is 
different from multi-personality approaches, where you have a database engine, 
that can be used as a document store or as a graph store - but not as both.

Having said that, I currently would not know which solution I would recommend 
to you. I'm sure I do not completely understand you data model and where graph 
are useful and where they are a hindrance. The same is true for the hardware. 
On one hand you want cheap hardware and spinning disk preferably even on a 
single node, on the other hand you dataset might require a cluster setup. Some 
of these requirements could be fulfilled by ArangoDB, some we would need to 
improve stuff (like finishing the spare indexes). On the other hand you might 
be better of with something like Elastic Search (if the graph searches are 
mostly fixed paths) or a combination.


TASK DETAIL
  https://phabricator.wikimedia.org/T88549

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Fceller
Cc: Neunhoef, Fceller, JanZerebecki, Aklapper, Manybubbles, jkroll, Smalyshev, 
Wikidata-bugs, aude, GWicke, daniel



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T88549: Investigate ArangoDB for Wikidata Query

Reply via email to