[Wikidata-bugs] [Maniphest] [Commented On] T88549: Investigate ArangoDB for Wikidata Query

Manybubbles Fri, 06 Feb 2015 08:02:30 -0800

Manybubbles added a comment.

In https://phabricator.wikimedia.org/T88549#1018178, @Fceller wrote:


> Hi, I'm the CTO of ArangoDB, so my comments are most certainly biased. I 
> still would like to tell you about our opinions on the raised issues, namely 
> full-text indexes and blueprint.


Thanks for replying!

> (1) We do not believe that TP is helpful in a shared environment. Gremlin is 
> a nice language, but it requires you to move a lot of data into the client. 
> This works very well if you can embedded the database and keep it in the same 
> process space. As soon as you need to shard the data and spread it to many 
> servers you will move a lot of data between Gremlin and the DBservers. 
> Therefore we decided to create a Javascript version of Gremlin which runs 
> directly on the shards thus minimising the amount of moved data. Therefore it 
> is indeed true, that we did not add support for TP3 because we believe it 
> will be of limited use.


Have a look at what they are working on now in their master branch - I think 
they've struck on a good notion: they _heavily_ deprecating predicates in place 
of anonymous filters.  And those should be possible to optimize.

> (2) Fulltext indexes are not our main expertise. We think that search engines 
> like ElasticSearch, Solr are much better in this - especially when it comes 
> to stemming, different languages, phonetic searches. There is an elastic 
> search plugin to use ElasticSearch as fulltext search engine for ArangoDB. 
> The fulltext index is indeed very slow when building. We want to speed up the 
> process and hopefully can improve there over time (see also the next bullet 
> point). I assume that you are using a fulltext index in your example, right?


That makes sense.  I don't plan on using full text indexes in this project at 
all unless something unexpected comes up.  Even so, we have much more 
experience with Elasticsearch and Lucene so it'd make sense to go there.

> (3) We decided to keep the indexes only in memory. The reason are as follows.

> 

> There are various possibilities:

> 

> (1) use memory only indexes (this is currently implemented in ArangoDB)

>  (2) use disk-based indexes (this is currently implemented in CouchDB)

>  (3) disk-backed with a file-system like clean flag

>  (4) other solutions like keeping only parts in memory, use memory as a 
> cache, and so on are also possible

> 

> There is a trade-off:

> 

> Runtime behaviour:

> 

> (1) this is the fastest solution

>  (2) this is the slowest solution because you need to ensure that there are 
> no inconsistencies even in case of a server crash. If you have a look at what 
> CouchDB you will see what I mean. You need to do much more synching then in 
> (1).

>  (3) could be nearly as fast as (1)

> 

> Startup behaviour:

> 

> (1) this is the slowest solution

>  (2) this is the fastest solution

>  (3) depends: with a clean shutdown as fast as (2), with a crash as slow as 
> (1)

> 

> So if you expect your server to crash often, then (1) might not be a good 
> idea. If you expect your server to run stable, then (1) might be much fast 
> during normal operations. The best of all world would be (3). ArangoDB 
> currently uses (1), but we want to switch to (3).


You could also go with a Lucene-like write once behavior.  I don't know that 
it'd be a good match at all though.  It matches well with the infrequently 
updated asynchronous nature of full text search but it feels lie it'd be more 
troubling for something like ArangoDB.  Also probably more work to implement 
than clean shutdown.  Anyway, i'm sure you've spent more time thinking about it 
than I have.


TASK DETAIL
  https://phabricator.wikimedia.org/T88549

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Manybubbles
Cc: Neunhoef, Fceller, JanZerebecki, Aklapper, Manybubbles, jkroll, Smalyshev, 
Wikidata-bugs, aude, GWicke, daniel



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T88549: Investigate ArangoDB for Wikidata Query

Reply via email to