Accumulo indexing social media data

mohit.kaushik Wed, 01 Jul 2015 00:17:51 -0700

Hi,

We have an social media application currently using MongoDB to servedocuments . We decided to shift it to Accumulo. I am designing theschema and indexing approach but having some difficulties in managingindexes and a few concerns with generating UUID in Accumulo.

UUID : The data is being indexed in MongoDB 24 hours. MongoDB generatesa 12 byte UUID sorted on current time and good for multi-usermulti-process environment (<time> <Mac add> <process id> <clientcounter> ) which is perfect. but if I concatenate the time,mac add,process-id, client counter. These are around 28 to 30 characters whichmeans around 60 bytes. And If I store it in reverse order so that thelatest document shows on top, the size would be doubled( more than 120bytes) as described by David Medinets. Is there any way to store thisUUID in lesser size or any other efficient way to generate UUID reversesorted on current time.

Indexing : I need to retrieve documents from index based on some queryon fields. I found two approaches to index documents in Accumulo.

(1) Term based reverse indexing and
(2) Document partitioning indexing

As Adam described in this videohttps://www.youtube.com/watch?v=Ck70G6OuGT4. If I use Documentpartitioning indexing.


Row                    <partition id>
                               /            \
CF                 <doc>            <index>
                           |                       |
CQ                <UUID>          <Term>
                           |                       |
                      <field>           <UUID>
                           |                        |
                           |                  <Field>
Value            <value>

If I just want to serve documents based on single term query. Would itbe better to store <term> in column family so that I can limit on singleterm in CF. It will reduce the data by a good factor. what can be otherpros and cons of this approach?And how should i decide the on partition_Id. If i storing tweets on 3node cluster?


Regards
Mohit Kaushik

Accumulo indexing social media data

Reply via email to