[Wikidata-bugs] [Maniphest] [Commented On] T88549: Investigate ArangoDB for Wikidata Query

Neunhoef Fri, 06 Feb 2015 14:16:52 -0800

Neunhoef added a comment.

> Do you think sparse indexes are going to give us enough performance thousands 
> of these indexes with similar startup times to what we see now?  Is that 
> something we can hack around by using fewer indexes and searching them in 
> more interesting ways?  Say we just make one index for all the badges and for 
> every document with a badge we index an entry for its wiki, the badge name, 
> and its wiki_its badge name.  That way I can query all entries with "enwiki" 
> badges.  Or all entries with "featured" badges.  Or all entries with 
> "enwiki_featured" badges.  We might be able to play similar clever tricks 
> with the attributes.



First of all, a sparse index will only consume memory for those documents which 
have the indexed attribute set. This is what makes it feasible at all to have 
"thousands of indexes". Secondly, inserting a document in thousands of indexes 
can take considerable time (see the timings above), however, a sparse index has 
to do nothing for a document that has its attribute value unset. So also for 
the insertion time the sparse indexes are needed.

I cannot really answer your question, in particular since it will depend on 
whether you have only "thousands of hash indexes" or even "thousands of 
skiplist indexes", as the above timings suggest. For 16M documents, the 
difference between O(1) and O(log(n)) complexity really matters (log(16M) is 
about 24, after all...). Furthermore, the actual sparsity of the attribute 
values for your indexes will matter.

Therefore, playing clever tricks to reduce the amount of indexes like you 
describe is definitely a good idea. A database engineer (of any flavour), 
should at least be a tiny bit scared when he reads "thousands of indexes", 
because no DB engine I know of is really happy about this prospect. Cassandra 
for example will, as far as I know, duplicate the data many (thousands of?) 
times to offer this type of indexing...

Furthermore, we have not yet talked about edges. How large is the data about 
your 100M edges? Do the edges carry substantial amounts of data themselves? Is 
there a sample of this data available online anywhere? Do you need to index the 
edges in any way? Please keep in mind that the edge collection will need at 
least the "edge-index" of its own...

Finally, for an informed decision about the database engine one would have to 
know what kind of queries will hit the database later in production. In 
particular for graph-like queries and queries mixing graph- with index lookups 
and possibly joins, one has to look carefully to see how they would perform, in 
particular with sharding. Do you have any information about the needed queries 
for your use case?


TASK DETAIL
  https://phabricator.wikimedia.org/T88549

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Neunhoef
Cc: Neunhoef, Fceller, JanZerebecki, Aklapper, Manybubbles, jkroll, Smalyshev, 
Wikidata-bugs, aude, GWicke, daniel



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T88549: Investigate ArangoDB for Wikidata Query

Reply via email to