[Wikidata-bugs] [Maniphest] [Commented On] T88549: Investigate ArangoDB for Wikidata Query

Manybubbles Fri, 06 Feb 2015 14:33:12 -0800

Manybubbles added a comment.

In https://phabricator.wikimedia.org/T88549#1021733, @Neunhoef wrote:


> I cannot really answer your question, in particular since it will depend on 
> whether you have only "thousands of hash indexes" or even "thousands of 
> skiplist indexes", as the above timings suggest. For 16M documents, the 
> difference between O(1) and O(log(n)) complexity really matters (log(16M) is 
> about 24, after all...). Furthermore, the actual sparsity of the attribute 
> values for your indexes will matter.


We'd need the indexes that can do range queries - skiplist I presume.

Reality dictates sparsity here - we'll be pretty sparse.  Properties that only 
make sense on people <https://www.wikidata.org/wiki/Q23> will rarely be on 
abstract concepts <https://www.wikidata.org/wiki/Q11471>.  There isn't anything 
from preventing it from time to time, but it should be rare.

> Therefore, playing clever tricks to reduce the amount of indexes like you 
> describe is definitely a good idea. A database engineer (of any flavour), 
> should at least be a tiny bit scared when he reads "thousands of indexes", 
> because no DB engine I know of is really happy about this prospect. Cassandra 
> for example will, as far as I know, duplicate the data many (thousands of?) 
> times to offer this type of indexing...


Lucene handles maintaining lots of indexes quite well.  You can't query 
thousands of indexes at a time (you have to play tricks on that end) but you 
can maintain thousands of indexes - especially if most documents don't contain 
the fields.

> Furthermore, we have not yet talked about edges. How large is the data about 
> your 100M edges? Do the edges carry substantial amounts of data themselves? 
> Is there a sample of this data available online anywhere? Do you need to 
> index the edges in any way? Please keep in mind that the edge collection will 
> need at least the "edge-index" of its own...


I can't think offhand of any indexes we'll need to edges but Stas probably 
knows a few.

> Finally, for an informed decision about the database engine one would have to 
> know what kind of queries will hit the database later in production. In 
> particular for graph-like queries and queries mixing graph- with index 
> lookups and possibly joins, one has to look carefully to see how they would 
> perform, in particular with sharding. Do you have any information about the 
> needed queries for your use case?


Lots of stuff.  Lots of graph traversal stuff.

"List the 10 cities with the most population that have female mayors and are in 
Europe" <-- currently cities are actually listed as being in counties or 
regions so we'd either have to flatten that hierarchy or traverse.  We'll still 
have to traverse to the mayor and check its gender.

"Find me all of the humans that were born before 1880 and don't have a date of 
death"

"Find me all of the humans who's father doesn't have that human listed as a 
child"  "The mother"

"Return the family tree of George Washinton (say we know his id, good old Q23)"

"How many humans (instanceOf Q1) do we have data for?"


TASK DETAIL
  https://phabricator.wikimedia.org/T88549

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Manybubbles
Cc: Neunhoef, Fceller, JanZerebecki, Aklapper, Manybubbles, jkroll, Smalyshev, 
Wikidata-bugs, aude, GWicke, daniel



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T88549: Investigate ArangoDB for Wikidata Query

Reply via email to