Thanks Jim, do you mean the least significant bits of the timestamp?
On Tue, Nov 27, 2012 at 4:45 PM, Jim Klucar <[email protected]> wrote: > Roshan, > > Depending on what your cluster setup is and what the resolution of the > time stamp is you could do something like this to spread the data around: > > <timestamp-LSBs>-<string>-<reverse timestamp> > > Using the LSBs of the timestamp as a uniform hash, then splitting on all > possible hashes would spread things around a bit. If you do this, then all > scans must check all hashes for data. > > > > > On Tue, Nov 27, 2012 at 1:25 PM, Keith Turner <[email protected]> wrote: > >> >> >> On Tue, Nov 27, 2012 at 1:22 PM, Roshan Punnoose <[email protected]>wrote: >> >>> Thanks! >>> >>> The fact that you are using a binary tree behind the scenes makes >>> perfect sense. Btw, what do you use in the standalone (non native) >>> implementation? Does it use a TreeMap? >>> >> >> When not using native code, ConcurrentSkipListMap is used. >> >> >>> >>> >>> On Tue, Nov 27, 2012 at 12:57 PM, Keith Turner <[email protected]> wrote: >>> >>>> >>>> >>>> On Tue, Nov 27, 2012 at 12:21 PM, Roshan Punnoose <[email protected]>wrote: >>>> >>>>> The <string> would most likely be a fixed set of strings that do not >>>>> change over time. >>>>> >>>>> My question is if it is bad to use a reverse index timestamp in the >>>>> row id? Will it cause problems with the tablet splitting, compaction, and >>>>> performance if the data is always being sent to the top of the tablet? If >>>>> I >>>>> define a split as everything prefixed with <string>, then the ingest will >>>>> go to one tablet, but then I add a reverse timestamp in the row, and that >>>>> would mean I am always copying data to the top of the tablet. Will this >>>>> cause performance issues? Or is it better to append to a tablet? >>>>> >>>> >>>> I do not think it should matter. Inserts go into a C++ STL map on the >>>> tablet server if using the nativemap. I think the implementation of that >>>> is a balanced binary tree. So I do not think inserting at the beginning vs >>>> the end would make difference. That being said, I do not think I have >>>> tried this so I do not know if there would be any suprises. I would be >>>> interested in hearing about your experiences. >>>> >>>> >>>>> >>>>> >>>>> On Tue, Nov 27, 2012 at 11:51 AM, Keith Turner <[email protected]>wrote: >>>>> >>>>>> >>>>>> >>>>>> Keith >>>>>> >>>>>> On Tue, Nov 27, 2012 at 10:41 AM, Roshan Punnoose >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> I want to have a table where the row will consist of >>>>>>> "<string>-<reverse index timestamp>". But this means that the data is >>>>>>> always being prefixed to the beginning of the row (or tablet if the row >>>>>>> is >>>>>>> large). Will this be a problem for compaction or performance? >>>>>> >>>>>> >>>>>> Can you tell me more about what <string> is? For example is it a >>>>>> hash or does it come from the set "foo1","foo2","foo3". How does it >>>>>> change over time? I think the answer to your question depends on what >>>>>> <string> is. >>>>>> >>>>>> >>>>>>> >>>>>>> I don't know if I heard this correctly, but someone once mentioned >>>>>>> that making the row id the direct timestamp could cause performance >>>>>>> issues >>>>>>> because data is always going to one tablet, but also because there is >>>>>>> trouble splitting since it always appends to the tablet. Is this true, >>>>>>> is >>>>>>> it similar to what could happen if I am always prefixing to a tablet? >>>>>>> >>>>>> >>>>>> Yes using a timestamp for a row could cause data from many clients to >>>>>> always go to the same tablet, which would be bad for performance on a >>>>>> cluster. >>>>>> >>>>>> >>>>>>> >>>>>>> Thanks! >>>>>>> Roshan >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
