Tom D wrote:
Hi,
Have been doing a little reading about different distributed (text)
indexing techniques and picked up on the Document Partitioned Index
approach on Accumulo.
I am interested in the use-cases people would have for indexing data in
this way over using a distributed search service (Elastic or SolrCloud).
I can think of a few reasons, but wondered if there's something more
obvious that I'm missing?
- cell (field level) access controls
If you have this as a requirement, you're in the right place :)
- scale - I understand Accumulo will scale to thousands of nodes. I
believe there are some limitations in Elastic / Solr at about 100 nodes.
High speed ingest and random point-lookups are big architectural
features that Accumulo provides. I don't know enough about ES/Solr to
say how they compare, but I can say that these fundamentals will work
well from one to many nodes with Accumulo.
- integration with an existing schema or index in Accumulo (not sure
about this one and what benefits it would have over calling out to a
search service)
- you want to take advantage of other features in Accumulo, e.g.
Combining iterators to perform some aggregation alongside your document
partitioned index (again, can't imagine use cases here, but maybe there
are some)
Being able to leverage some of the "native" filtering aspects that
Accumulo provides (e.g. locality groups/column-family filtering,
server-side filters/iterators and combiners) result in a light-weight
client. The I/O heavy operations are done by Accumulo and pass a
reduced/filtered view of just the data you need reducing the CPU cycles
for your client and the amount of data sent over the wire (increasing
the performance of your application).
- more control over 'messy data', e.g partial duplicates that need
merging at ingest
Maybe? Not requiring a fixed schema on each row is definitely a perk of
Accumulo, but data cleansing isn't necessarily solved by Accumulo. You
still need to know what you put into it.
However, being able to aggregate multiple updates to a Cell/Value via
Accumulo Combiners can be a very powerful tool that simplifies your
ingest logic.
Are there others? Be interesting to hear if people use this indexing
strategy.
It's definitely a common indexing strategy and you've identified a lot
of the perks that Accumulo provides. The specific requirements of your
application will determine how exactly you will leverage the features.
Let us know, we can help give some pointers on how to go about this :)
Many thanks.