Christopher,
What I understood from the Medinets explanation of reverse sorting is
first he subtracts every character from 255 to make it reverse index.
and also append the original UUID to that string. When I checked the D4M
schema, it prints ??????? in the front of UUID which I suppose the
characters subtracted from 255, if I am not misunderstood.
And the problem with 60 bytes or 120 bytes is nothing. I just don't want
to waste space for no benefits at all. when It can be done in 12 or 13
bytes. And Thanks I looked at the MongoDriver code. I supposed that the
encoding may not fit to lexicographical sorting.
Can you please provide some inputs on deciding the Partition Id?
-Mohit Kaushik
On 07/01/2015 09:55 PM, Christopher wrote:
I'm not sure I understand why the size would be doubled.... if you
store it in reverse order, it's not going to take up more bytes. Are
you storing it forward *and* reverse? If so, why?
Also, forgive me for asking, but 60 bytes doesn't seem problematic to
me... that's going to be compressed on disk anyway. Why is 60 bytes
too large for your use case?
Also, if the MongoDB 12 byte UUID was sufficient, why aren't you using
a UUID that is in that same format?
Regarding serving documents based on a single term query... it seems
to me that if that is your only requirement, then a row which looks
like "<term> <UUID>" would be more appropriate, since the best way to
support single-term query is to index on that term (UUID added only to
enable rows to split).
--
Christopher L Tubbs II
http://gravatar.com/ctubbsii
On Wed, Jul 1, 2015 at 3:16 AM, mohit.kaushik <[email protected]> wrote:
Hi,
We have an social media application currently using MongoDB to serve
documents . We decided to shift it to Accumulo. I am designing the schema
and indexing approach but having some difficulties in managing indexes and a
few concerns with generating UUID in Accumulo.
UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates a 12
byte UUID sorted on current time and good for multi-user multi-process
environment (<time> <Mac add> <process id> <client counter> ) which
is perfect. but if I concatenate the time,mac add, process-id, client
counter. These are around 28 to 30 characters which means around 60 bytes.
And If I store it in reverse order so that the latest document shows on top,
the size would be doubled( more than 120 bytes) as described by David
Medinets. Is there any way to store this UUID in lesser size or any other
efficient way to generate UUID reverse sorted on current time.
Indexing : I need to retrieve documents from index based on some query on
fields. I found two approaches to index documents in Accumulo.
(1) Term based reverse indexing and
(2) Document partitioning indexing
As Adam described in this video https://www.youtube.com/watch?v=Ck70G6OuGT4.
If I use Document partitioning indexing.
Row <partition id>
/ \
CF <doc> <index>
| |
CQ <UUID> <Term>
| |
<field> <UUID>
| |
| <Field>
Value <value>
If I just want to serve documents based on single term query. Would it be
better to store <term> in column family so that I can limit on single term
in CF. It will reduce the data by a good factor. what can be other pros and
cons of this approach?
And how should i decide the on partition_Id. If i storing tweets on 3 node
cluster?
Regards
Mohit Kaushik
--
Signature
*Mohit Kaushik*
Software Engineer
A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
*Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553
<http://politicomapper.orkash.com>interactive social intelligence at work...
<https://www.facebook.com/Orkash2012>
<http://www.linkedin.com/company/orkash-services-private-limited>
<https://twitter.com/Orkash> <http://www.orkash.com/blog/>
<http://www.orkash.com>
<http://www.orkash.com> ... ensuring Assurance in complexity and uncertainty
/This message including the attachments, if any, is a confidential
business communication. If you are not the intended recipient it may be
unlawful for you to read, copy, distribute, disclose or otherwise use
the information in this e-mail. If you have received it in error or are
not the intended recipient, please destroy it and notify the sender
immediately. Thank you /