Hi,
We have an social media application currently using MongoDB to serve
documents . We decided to shift it to Accumulo. I am designing the
schema and indexing approach but having some difficulties in managing
indexes and a few concerns with generating UUID in Accumulo.
UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates
a 12 byte UUID sorted on current time and good for multi-user
multi-process environment (<time> <Mac add> <process id> <client
counter> ) which is perfect. but if I concatenate the time,mac add,
process-id, client counter. These are around 28 to 30 characters which
means around 60 bytes. And If I store it in reverse order so that the
latest document shows on top, the size would be doubled( more than 120
bytes) as described by David Medinets. Is there any way to store this
UUID in lesser size or any other efficient way to generate UUID reverse
sorted on current time.
Indexing : I need to retrieve documents from index based on some query
on fields. I found two approaches to index documents in Accumulo.
(1) Term based reverse indexing and
(2) Document partitioning indexing
As Adam described in this video
https://www.youtube.com/watch?v=Ck70G6OuGT4. If I use Document
partitioning indexing.
Row <partition id>
/ \
CF <doc> <index>
| |
CQ <UUID> <Term>
| |
<field> <UUID>
| |
| <Field>
Value <value>
If I just want to serve documents based on single term query. Would it
be better to store <term> in column family so that I can limit on single
term in CF. It will reduce the data by a good factor. what can be other
pros and cons of this approach?
And how should i decide the on partition_Id. If i storing tweets on 3
node cluster?
Regards
Mohit Kaushik