Sorry, I'm not that familiar with the D4M schema. Regarding partitioning, I agree with Josh's response.
-- Christopher L Tubbs II http://gravatar.com/ctubbsii On Thu, Jul 2, 2015 at 8:34 AM, mohit.kaushik <[email protected]> wrote: > Christopher, > > What I understood from the Medinets explanation of reverse sorting is > first he subtracts every character from 255 to make it reverse index. and > also append the original UUID to that string. When I checked the D4M > schema, it prints ??????? in the front of UUID which I suppose the > characters subtracted from 255, if I am not misunderstood. > > And the problem with 60 bytes or 120 bytes is nothing. I just don't want > to waste space for no benefits at all. when It can be done in 12 or 13 > bytes. And Thanks I looked at the MongoDriver code. I supposed that the > encoding may not fit to lexicographical sorting. > > Can you please provide some inputs on deciding the Partition Id? > > -Mohit Kaushik > > > On 07/01/2015 09:55 PM, Christopher wrote: > > I'm not sure I understand why the size would be doubled.... if you > store it in reverse order, it's not going to take up more bytes. Are > you storing it forward *and* reverse? If so, why? > > Also, forgive me for asking, but 60 bytes doesn't seem problematic to > me... that's going to be compressed on disk anyway. Why is 60 bytes > too large for your use case? > > Also, if the MongoDB 12 byte UUID was sufficient, why aren't you using > a UUID that is in that same format? > > Regarding serving documents based on a single term query... it seems > to me that if that is your only requirement, then a row which looks > like "<term> <UUID>" would be more appropriate, since the best way to > support single-term query is to index on that term (UUID added only to > enable rows to split). > > -- > Christopher L Tubbs IIhttp://gravatar.com/ctubbsii > > > On Wed, Jul 1, 2015 at 3:16 AM, mohit.kaushik <[email protected]> > <[email protected]> wrote: > > Hi, > > We have an social media application currently using MongoDB to serve > documents . We decided to shift it to Accumulo. I am designing the schema > and indexing approach but having some difficulties in managing indexes and a > few concerns with generating UUID in Accumulo. > > UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates a 12 > byte UUID sorted on current time and good for multi-user multi-process > environment (<time> <Mac add> <process id> <client counter> ) which > is perfect. but if I concatenate the time,mac add, process-id, client > counter. These are around 28 to 30 characters which means around 60 bytes. > And If I store it in reverse order so that the latest document shows on top, > the size would be doubled( more than 120 bytes) as described by David > Medinets. Is there any way to store this UUID in lesser size or any other > efficient way to generate UUID reverse sorted on current time. > > Indexing : I need to retrieve documents from index based on some query on > fields. I found two approaches to index documents in Accumulo. > (1) Term based reverse indexing and > (2) Document partitioning indexing > > As Adam described in this video https://www.youtube.com/watch?v=Ck70G6OuGT4. > If I use Document partitioning indexing. > > Row <partition id> > / \ > CF <doc> <index> > | | > CQ <UUID> <Term> > | | > <field> <UUID> > | | > | <Field> > Value <value> > > If I just want to serve documents based on single term query. Would it be > better to store <term> in column family so that I can limit on single term > in CF. It will reduce the data by a good factor. what can be other pros and > cons of this approach? > And how should i decide the on partition_Id. If i storing tweets on 3 node > cluster? > > Regards > Mohit Kaushik > > > > > -- > > * Mohit Kaushik* > Software Engineer > A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India > *Tel:* +91 (124) 4969352 | *Fax:* +91 (124) 4033553 > > <http://politicomapper.orkash.com>interactive social intelligence at > work... > > <https://www.facebook.com/Orkash2012> > <http://www.linkedin.com/company/orkash-services-private-limited> > <https://twitter.com/Orkash> <http://www.orkash.com/blog/> > <http://www.orkash.com> > <http://www.orkash.com> ... ensuring Assurance in complexity and > uncertainty > > *This message including the attachments, if any, is a confidential > business communication. If you are not the intended recipient it may be > unlawful for you to read, copy, distribute, disclose or otherwise use the > information in this e-mail. If you have received it in error or are not the > intended recipient, please destroy it and notify the sender immediately. > Thank you * >
