[Wikidata-bugs] [Maniphest] [Updated] T88549: Investigate ArangoDB for Wikidata Query

Manybubbles Sat, 07 Feb 2015 06:29:32 -0800

Manybubbles added a comment.

In https://phabricator.wikimedia.org/T88549#1022562, @Fceller wrote:


> Maybe we can take a step back and ignore the ArangoDB specifics for the 
> moment. I'm also organising NoSQL conferences and consulting NoSQL in general.
>
> Still, I must admit that I'm not familiar with the internal data model of 
> Wikipedia. I've checked with George Washington (Q23) that he as a lot of 
> properties associated with him. However, I fail to see how the traversals you 
> mentioned are defined. For example "Give me the list of countries sorted by 
> population?". How does the data model look like? "population" is an attribute 
> of "country"? All countries are connected to a "special" node via an edge? Or 
> are country identified by a special property? If there is a special "world" 
> node containing edges to all countries, then there is no need for indexes. If 
> there is a "world" node connecting everything and you need to filter the 
> edges, indexes might help.


Sure.  Firstly, this is Wikidata, not Wikipedia.  From a software perspective 
Wikipedia is "just" a fully tricked out MediaWiki instance - its data model 
concerns itself with things like pages, revisions of pages, and templates.  Its 
a wiki, a tool for building pages.

Wikidata is different.  Its a tool for collaboratively editing structured data. 
 At its core its written on MediaWiki as well and somewhere in its depths it 
serializes the structured data to json blobs and saves it in the same table 
that stores revisions on Wikipedia (different actual database, same table).  
That technical background is only really relevant in that it tells you that the 
native storage isn't particularly queryable.

So this tool's job isn't to be the primary data store at all.  Its job is to 
make wikidata searchable as a knowledge graph.

The simplest way to find all countries (assuming you have a graph database) is 
start with country (Q6256) and then trace all incoming links of the instance of 
type (https://phabricator.wikimedia.org/P31).

> In general, graph model are very useful if you have paths of different length 
> occurring in your query. For example, find a descendent with a given 
> property. On the other hand, if your path always has a fixed length, it will 
> be much faster to use some sort of indexes. Graph queries are fast, if you 
> have natural start node. If you have to find nodes with a given property, it 
> is much better to use document databases (see you example "find a city with a 
> female mayor". Sometimes it is possible to combine both approaches. For 
> examples, find cities with female mayors and then do a traversal from these 
> cities. That is what I coin "multi-model". To be able to switch between 
> models in an query. It is different from multi-personality approaches, where 
> you have a database engine, that can be used as a document store or as a 
> graph store - but not as both.


I think of that as a matter of optimization though.  I'm not sure I can justify 
collapsing mayor gender into all cities, for example.  I can almost certainly 
justify collapsing country into cities, though.  This isn't anything unique to 
a particular technology though.

> Having said that, I currently would not know which solution I would recommend 
> to you. I'm sure I do not completely understand you data model and where 
> graph are useful and where they are a hindrance. The same is true for the 
> hardware. On one hand you want cheap hardware and spinning disk preferably 
> even on a single node, on the other hand you dataset might require a cluster 
> setup. Some of these requirements could be fulfilled by ArangoDB, some we 
> would need to improve stuff (like finishing the spare indexes). On the other 
> hand you might be better of with something like Elastic Search (if the graph 
> searches are mostly fixed paths) or a combination.


The thing has to be installed in multiple places - certainly in production 
where we can spend money on ram and ssds and such.  But we'll also install in 
labs which is VMs on shared storage.  Its acceptable for labs to perform worse. 
 Its not acceptable for the install to be impossible.  The same goes for 
machines for researchers/tinkerers/bot authors.  Those can be slow so long as 
they can use it at all.


TASK DETAIL
  https://phabricator.wikimedia.org/T88549

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Manybubbles
Cc: Neunhoef, Fceller, JanZerebecki, Aklapper, Manybubbles, jkroll, Smalyshev, 
Wikidata-bugs, aude, GWicke, daniel



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Updated] T88549: Investigate ArangoDB for Wikidata Query

Reply via email to