We are evaluating ArangoDB performance in space of facets calculations.
There are number of other products capable of doing the same, either via 
special API  or query language:

   - MarkLogic Facets <https://docs.marklogic.com/jsearch.facet>
   - ElasticSearch Aggregations 
   
<https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html>
   - Solr Faceting 
   <https://cwiki.apache.org/confluence/display/solr/Faceting>
   - etc

We understand, there is no special API in Arango to calculate factes 
explicitly.
But in reality, it is not needed, thanks for a comprehensive AQL it can be 
easily achieved via simple query, like:

 FOR a in Asset 
  COLLECT attr = a.attribute1 INTO g
 RETURN { value: attr, count: length(g) }

This query calculate a facet on *attribute1* and yields frequency in the 
form of:

[
  {
    "value": "test-attr1-1",
    "count": 2000000
  },
  {
    "value": "test-attr1-2",
    "count": 2000000
  },
  {
    "value": "test-attr1-3",
    "count": 3000000
  }
]


It is saying, that across my entire collection *attribute1* took three 
forms (test-attr1-1, test-attr1-2 and test-attr1-3) with related counts 
provided.
Pretty much we run a DISTINCT query and aggregated counts.

Looks simple and clean. With only one, but really big issue - *performance*.

Provided query above runs for !*31 seconds*! on top of the test collection 
with only *8M* documents.
We have experimented with different index types, storage engines (with 
rocksdb and without), investigating explanation plans at no avail.
Test documents we use in this test are very concise with only three short 
attributes.

We would appreciate any input at this point.
Either we doing something wrong. Or ArangoDB simply is not designed to 
perform in this particular area.

btw, ultimate goal would be to run something like the following in 
under-second time:

LET docs = (FOR a IN Asset 

  FILTER a.name like 'test-asset-%'

  SORT a.name

 RETURN a)

LET attribute1 = (

 FOR a in docs 

  COLLECT attr = a.attribute1 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute2 = (

 FOR a in docs 

  COLLECT attr = a.attribute2 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute3 = (

 FOR a in docs 

  COLLECT attr = a.attribute3 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute4 = (

 FOR a in docs 

  COLLECT attr = a.attribute4 INTO g

 RETURN { value: attr, count: length(g[*])}

)

RETURN {

  counts: (RETURN {

    total: LENGTH(docs), 

    offset: 2, 

    to: 4, 

    facets: {

      attribute1: {

        from: 0, 

        to: 5,

        total: LENGTH(attribute1)

      },

      attribute2: {

        from: 5, 

        to: 10,

        total: LENGTH(attribute2)

      },

      attribute3: {

        from: 0, 

        to: 1000,

        total: LENGTH(attribute3)

      },

      attribute4: {

        from: 0, 

        to: 1000,

        total: LENGTH(attribute4)

      }

    }

  }),

  items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),

  facets: {

    attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),

    attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),

    attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),

    attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 
return a)

   }

}

Thanks!



-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to arangodb+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to