Re: Document Partitioned Indexing

mohit.kaushik Thu, 01 Oct 2015 02:56:16 -0700

Hi,

I am trying to find best way to Index documents data on the bases oftime in Accumulo. The core objective is to make the time based queriesfast/efficient. I have two types of date (may have be more types) and Iwant to query & index data on both.

As I a have to two timestamps(date), For the first Index if I create anindex and store time in the Rowid for one timestamp. This way I cancreate partial start and end id and can pass it as range in scanner. Andfor the other, say I group the documents Index on the bases of time sayper hour or per minute (one minute data goes to single row around 2500docs). Therefore, the Rowid contains the "time" the CF contains the"Field/value" and the CQ contains the "DocId".

(1) If I fetch a "field/value" as CF for a same time range from bothindexes. Which one would be faster.(2) If I create locality groups dynamically for every value inCF(field/value) and there are in total around 10000 distinctfield/values (say an index over location/city and per city there are100000 or more documents indexed on an avg). Means 10000 localitygroups, how will it affect the query performance ???


Thanks
Mohit Kaushik

On 09/30/2015 08:57 PM, Adam Fuchs wrote:

Hi Tom,
Sqrrl uses a document-distributed indexing strategy extensively. Ontop of the reasons you mentioned, we also like the ability toexplicitly structure our index entries in both information content andsort order. This gives us the ability to do interesting things likebuild custom indexes and do joins between graph indexes and term indexes.
Eventually, I'd like to see Accumulo build out explicit support forthis type of indexing in the core as an embedded secondary indexingcapability. That would solve several of the challenges aroundcompatibility with other Accumulo features and usage patterns.
Cheers,
Adam
On Wed, Sep 30, 2015 at 3:48 AM, Tom D <[email protected]<mailto:[email protected]>> wrote:
    Hi,

    Have been doing a little reading about different distributed
    (text) indexing techniques and picked up on the Document
    Partitioned Index approach on Accumulo.

    I am interested in the use-cases people would have for indexing
    data in this way over using a distributed search service (Elastic
    or SolrCloud).

    I can think of a few reasons, but wondered if there's something
    more obvious that I'm missing?

    - cell (field level) access controls

    - scale - I understand Accumulo will scale to thousands of nodes.
    I believe there are some limitations in Elastic / Solr at about
    100 nodes.

    - integration with an existing schema or index in Accumulo (not
    sure about this one and what benefits it would have over calling
    out to a search service)

    - you want to take advantage of other features in Accumulo, e.g.
    Combining iterators to perform some aggregation alongside your
    document partitioned index (again, can't imagine use cases here,
    but maybe there are some)

    - more control over 'messy data', e.g partial duplicates that need
    merging at ingest

    Are there others? Be interesting to hear if people use this
    indexing strategy.

    Many thanks.



--
Signature

*Mohit Kaushik*
Software Engineer
A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
*Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553

<http://politicomapper.orkash.com>interactive social intelligence at work...

<https://www.facebook.com/Orkash2012><http://www.linkedin.com/company/orkash-services-private-limited><https://twitter.com/Orkash> <http://www.orkash.com/blog/><http://www.orkash.com>

<http://www.orkash.com> ... ensuring Assurance in complexity and uncertainty

/This message including the attachments, if any, is a confidentialbusiness communication. If you are not the intended recipient it may beunlawful for you to read, copy, distribute, disclose or otherwise usethe information in this e-mail. If you have received it in error or arenot the intended recipient, please destroy it and notify the senderimmediately. Thank you /

Re: Document Partitioned Indexing

Reply via email to