Hi,
I am trying to find best way to Index documents data on the bases of
time in Accumulo. The core objective is to make the time based queries
fast/efficient. I have two types of date (may have be more types) and I
want to query & index data on both.
As I a have to two timestamps(date), For the first Index if I create an
index and store time in the Rowid for one timestamp. This way I can
create partial start and end id and can pass it as range in scanner. And
for the other, say I group the documents Index on the bases of time say
per hour or per minute (one minute data goes to single row around 2500
docs). Therefore, the Rowid contains the "time" the CF contains the
"Field/value" and the CQ contains the "DocId".
(1) If I fetch a "field/value" as CF for a same time range from both
indexes. Which one would be faster.
(2) If I create locality groups dynamically for every value in
CF(field/value) and there are in total around 10000 distinct
field/values (say an index over location/city and per city there are
100000 or more documents indexed on an avg). Means 10000 locality
groups, how will it affect the query performance ???
Thanks
Mohit Kaushik
On 09/30/2015 08:57 PM, Adam Fuchs wrote:
Hi Tom,
Sqrrl uses a document-distributed indexing strategy extensively. On
top of the reasons you mentioned, we also like the ability to
explicitly structure our index entries in both information content and
sort order. This gives us the ability to do interesting things like
build custom indexes and do joins between graph indexes and term indexes.
Eventually, I'd like to see Accumulo build out explicit support for
this type of indexing in the core as an embedded secondary indexing
capability. That would solve several of the challenges around
compatibility with other Accumulo features and usage patterns.
Cheers,
Adam
On Wed, Sep 30, 2015 at 3:48 AM, Tom D <[email protected]
<mailto:[email protected]>> wrote:
Hi,
Have been doing a little reading about different distributed
(text) indexing techniques and picked up on the Document
Partitioned Index approach on Accumulo.
I am interested in the use-cases people would have for indexing
data in this way over using a distributed search service (Elastic
or SolrCloud).
I can think of a few reasons, but wondered if there's something
more obvious that I'm missing?
- cell (field level) access controls
- scale - I understand Accumulo will scale to thousands of nodes.
I believe there are some limitations in Elastic / Solr at about
100 nodes.
- integration with an existing schema or index in Accumulo (not
sure about this one and what benefits it would have over calling
out to a search service)
- you want to take advantage of other features in Accumulo, e.g.
Combining iterators to perform some aggregation alongside your
document partitioned index (again, can't imagine use cases here,
but maybe there are some)
- more control over 'messy data', e.g partial duplicates that need
merging at ingest
Are there others? Be interesting to hear if people use this
indexing strategy.
Many thanks.
--
Signature
*Mohit Kaushik*
Software Engineer
A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
*Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553
<http://politicomapper.orkash.com>interactive social intelligence at work...
<https://www.facebook.com/Orkash2012>
<http://www.linkedin.com/company/orkash-services-private-limited>
<https://twitter.com/Orkash> <http://www.orkash.com/blog/>
<http://www.orkash.com>
<http://www.orkash.com> ... ensuring Assurance in complexity and uncertainty
/This message including the attachments, if any, is a confidential
business communication. If you are not the intended recipient it may be
unlawful for you to read, copy, distribute, disclose or otherwise use
the information in this e-mail. If you have received it in error or are
not the intended recipient, please destroy it and notify the sender
immediately. Thank you /