Hi,

I am trying to find best way to Index documents data on the bases of time in Accumulo. The core objective is to make the time based queries fast/efficient. I have two types of date (may have be more types) and I want to query & index data on both.

As I a have to two timestamps(date), For the first Index if I create an index and store time in the Rowid for one timestamp. This way I can create partial start and end id and can pass it as range in scanner. And for the other, say I group the documents Index on the bases of time say per hour or per minute (one minute data goes to single row around 2500 docs). Therefore, the Rowid contains the "time" the CF contains the "Field/value" and the CQ contains the "DocId".

(1) If I fetch a "field/value" as CF for a same time range from both indexes. Which one would be faster. (2) If I create locality groups dynamically for every value in CF(field/value) and there are in total around 10000 distinct field/values (say an index over location/city and per city there are 100000 or more documents indexed on an avg). Means 10000 locality groups, how will it affect the query performance ???

Thanks
Mohit Kaushik

On 09/30/2015 08:57 PM, Adam Fuchs wrote:
Hi Tom,

Sqrrl uses a document-distributed indexing strategy extensively. On top of the reasons you mentioned, we also like the ability to explicitly structure our index entries in both information content and sort order. This gives us the ability to do interesting things like build custom indexes and do joins between graph indexes and term indexes.

Eventually, I'd like to see Accumulo build out explicit support for this type of indexing in the core as an embedded secondary indexing capability. That would solve several of the challenges around compatibility with other Accumulo features and usage patterns.

Cheers,
Adam


On Wed, Sep 30, 2015 at 3:48 AM, Tom D <[email protected] <mailto:[email protected]>> wrote:

    Hi,

    Have been doing a little reading about different distributed
    (text) indexing techniques and picked up on the Document
    Partitioned Index approach on Accumulo.

    I am interested in the use-cases people would have for indexing
    data in this way over using a distributed search service (Elastic
    or SolrCloud).

    I can think of a few reasons, but wondered if there's something
    more obvious that I'm missing?

    - cell (field level) access controls

    - scale - I understand Accumulo will scale to thousands of nodes.
    I believe there are some limitations in Elastic / Solr at about
    100 nodes.

    - integration with an existing schema or index in Accumulo (not
    sure about this one and what benefits it would have over calling
    out to a search service)

    - you want to take advantage of other features in Accumulo, e.g.
    Combining iterators to perform some aggregation alongside your
    document partitioned index (again, can't imagine use cases here,
    but maybe there are some)

    - more control over 'messy data', e.g partial duplicates that need
    merging at ingest

    Are there others? Be interesting to hear if people use this
    indexing strategy.

    Many thanks.





--
Signature

*Mohit Kaushik*
Software Engineer
A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
*Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553

<http://politicomapper.orkash.com>interactive social intelligence at work...

<https://www.facebook.com/Orkash2012> <http://www.linkedin.com/company/orkash-services-private-limited> <https://twitter.com/Orkash> <http://www.orkash.com/blog/> <http://www.orkash.com>
<http://www.orkash.com> ... ensuring Assurance in complexity and uncertainty

/This message including the attachments, if any, is a confidential business communication. If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information in this e-mail. If you have received it in error or are not the intended recipient, please destroy it and notify the sender immediately. Thank you /

Reply via email to