[Wikidata-bugs] [Maniphest] [Commented On] T88549: Investigate ArangoDB for Wikidata Query

Fceller Thu, 05 Feb 2015 08:30:06 -0800

Fceller added a subscriber: Fceller.
Fceller added a comment.

Hi, I'm the CTO of ArangoDB, so my comments are most certainly biased. I still 
would like to tell you about our opinions on the raised issues, namely 
full-text indexes and blueprint.


(1) We do not believe that TP is helpful in a shared environment. Gremlin is a 
nice language, but it requires you to move a lot of data into the client. This 
works very well if you can embedded the database and keep it in the same 
process space. As soon as you need to shard the data and spread it to many 
servers you will move a lot of data between Gremlin and the DBservers. 
Therefore we decided to create a Javascript version of Gremlin which runs 
directly on the shards thus minimising the amount of moved data. Therefore it 
is indeed true, that we did not add support for TP3 because we believe it will 
be of limited use.

(2) Fulltext indexes are not our main expertise. We think that search engines 
like ElasticSearch, Solr are much better in this - especially when it comes to 
stemming, different languages, phonetic searches. There is an elastic search 
plugin to use ElasticSearch as fulltext search engine for ArangoDB. The 
fulltext index is indeed very slow when building. We want to speed up the 
process and hopefully can improve there over time (see also the next bullet 
point). I assume that you are using a fulltext index in your example, right?

(3) We decided to keep the indexes only in memory. The reason are as follows.

There are various possibilities:

(1) use memory only indexes (this is currently implemented in ArangoDB)
(2) use disk-based indexes (this is currently implemented in CouchDB)
(3) disk-backed with a file-system like clean flag
(4) other solutions like keeping only parts in memory, use memory as a cache, 
and so on are also possible

There is a trade-off:

Runtime behaviour:

(1) this is the fastest solution
(2) this is the slowest solution because you need to ensure that there are no 
inconsistencies even in case of a server crash. If you have a look at what 
CouchDB you will see what I mean. You need to do much more synching then in (1).
(3) could be nearly as fast as (1)

Startup behaviour:

(1) this is the slowest solution
(2) this is the fastest solution
(3) depends: with a clean shutdown as fast as (2), with a crash as slow as (1)

So if you expect your server to crash often, then (1) might not be a good idea. 
If you expect your server to run stable, then (1) might be much fast during 
normal operations. The best of all world would be (3). ArangoDB currently uses 
(1), but we want to switch to (3).


TASK DETAIL
  https://phabricator.wikimedia.org/T88549

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Fceller
Cc: Fceller, JanZerebecki, Aklapper, Manybubbles, jkroll, Smalyshev, 
Wikidata-bugs, aude, GWicke, daniel



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T88549: Investigate ArangoDB for Wikidata Query

Reply via email to