Tom D wrote:
Hi,

Have been doing a little reading about different distributed (text)
indexing techniques and picked up on the Document Partitioned Index
approach on Accumulo.

I am interested in the use-cases people would have for indexing data in
this way over using a distributed search service (Elastic or SolrCloud).

I can think of a few reasons, but wondered if there's something more
obvious that I'm missing?

- cell (field level) access controls

If you have this as a requirement, you're in the right place :)

- scale - I understand Accumulo will scale to thousands of nodes. I
believe there are some limitations in Elastic / Solr at about 100 nodes.

High speed ingest and random point-lookups are big architectural features that Accumulo provides. I don't know enough about ES/Solr to say how they compare, but I can say that these fundamentals will work well from one to many nodes with Accumulo.

- integration with an existing schema or index in Accumulo (not sure
about this one and what benefits it would have over calling out to a
search service)

- you want to take advantage of other features in Accumulo, e.g.
Combining iterators to perform some aggregation alongside your document
partitioned index (again, can't imagine use cases here, but maybe there
are some)

Being able to leverage some of the "native" filtering aspects that Accumulo provides (e.g. locality groups/column-family filtering, server-side filters/iterators and combiners) result in a light-weight client. The I/O heavy operations are done by Accumulo and pass a reduced/filtered view of just the data you need reducing the CPU cycles for your client and the amount of data sent over the wire (increasing the performance of your application).

- more control over 'messy data', e.g partial duplicates that need
merging at ingest

Maybe? Not requiring a fixed schema on each row is definitely a perk of Accumulo, but data cleansing isn't necessarily solved by Accumulo. You still need to know what you put into it.

However, being able to aggregate multiple updates to a Cell/Value via Accumulo Combiners can be a very powerful tool that simplifies your ingest logic.

Are there others? Be interesting to hear if people use this indexing
strategy.

It's definitely a common indexing strategy and you've identified a lot of the perks that Accumulo provides. The specific requirements of your application will determine how exactly you will leverage the features. Let us know, we can help give some pointers on how to go about this :)

Many thanks.


Reply via email to