Re: Document Partitioned Indexing

Josh Elser Wed, 30 Sep 2015 07:39:03 -0700

Tom D wrote:

Hi,


Have been doing a little reading about different distributed (text)
indexing techniques and picked up on the Document Partitioned Index
approach on Accumulo.

I am interested in the use-cases people would have for indexing data in
this way over using a distributed search service (Elastic or SolrCloud).

I can think of a few reasons, but wondered if there's something more
obvious that I'm missing?

- cell (field level) access controls


If you have this as a requirement, you're in the right place :)

- scale - I understand Accumulo will scale to thousands of nodes. I
believe there are some limitations in Elastic / Solr at about 100 nodes.

High speed ingest and random point-lookups are big architecturalfeatures that Accumulo provides. I don't know enough about ES/Solr tosay how they compare, but I can say that these fundamentals will workwell from one to many nodes with Accumulo.

- integration with an existing schema or index in Accumulo (not sure
about this one and what benefits it would have over calling out to a
search service)

- you want to take advantage of other features in Accumulo, e.g.
Combining iterators to perform some aggregation alongside your document
partitioned index (again, can't imagine use cases here, but maybe there
are some)

Being able to leverage some of the "native" filtering aspects thatAccumulo provides (e.g. locality groups/column-family filtering,server-side filters/iterators and combiners) result in a light-weightclient. The I/O heavy operations are done by Accumulo and pass areduced/filtered view of just the data you need reducing the CPU cyclesfor your client and the amount of data sent over the wire (increasingthe performance of your application).

- more control over 'messy data', e.g partial duplicates that need
merging at ingest

Maybe? Not requiring a fixed schema on each row is definitely a perk ofAccumulo, but data cleansing isn't necessarily solved by Accumulo. Youstill need to know what you put into it.

However, being able to aggregate multiple updates to a Cell/Value viaAccumulo Combiners can be a very powerful tool that simplifies youringest logic.

Are there others? Be interesting to hear if people use this indexing
strategy.

It's definitely a common indexing strategy and you've identified a lotof the perks that Accumulo provides. The specific requirements of yourapplication will determine how exactly you will leverage the features.Let us know, we can help give some pointers on how to go about this :)

Many thanks.

Re: Document Partitioned Indexing

Reply via email to