Sorry for the late reply, but yes, the LSBs of the timestamp are probably
fairly random, but perhaps not uniform enough depending on what is setting
the timestamp. You can just send the row info through a hash function if
you prefer.
On Tue, Nov 27, 2012 at 5:53 PM, Roshan Punnoose rosh...@gmail.com wrote:
Thanks Jim, do you mean the least significant bits of the timestamp?
On Tue, Nov 27, 2012 at 4:45 PM, Jim Klucar klu...@gmail.com wrote:
Roshan,
Depending on what your cluster setup is and what the resolution of the
time stamp is you could do something like this to spread the data around:
timestamp-LSBs-string-reverse timestamp
Using the LSBs of the timestamp as a uniform hash, then splitting on all
possible hashes would spread things around a bit. If you do this, then all
scans must check all hashes for data.
On Tue, Nov 27, 2012 at 1:25 PM, Keith Turner ke...@deenlo.com wrote:
On Tue, Nov 27, 2012 at 1:22 PM, Roshan Punnoose rosh...@gmail.comwrote:
Thanks!
The fact that you are using a binary tree behind the scenes makes
perfect sense. Btw, what do you use in the standalone (non native)
implementation? Does it use a TreeMap?
When not using native code, ConcurrentSkipListMap is used.
On Tue, Nov 27, 2012 at 12:57 PM, Keith Turner ke...@deenlo.comwrote:
On Tue, Nov 27, 2012 at 12:21 PM, Roshan Punnoose
rosh...@gmail.comwrote:
The string would most likely be a fixed set of strings that do not
change over time.
My question is if it is bad to use a reverse index timestamp in the
row id? Will it cause problems with the tablet splitting, compaction, and
performance if the data is always being sent to the top of the tablet?
If I
define a split as everything prefixed with string, then the ingest will
go to one tablet, but then I add a reverse timestamp in the row, and that
would mean I am always copying data to the top of the tablet. Will this
cause performance issues? Or is it better to append to a tablet?
I do not think it should matter. Inserts go into a C++ STL map on the
tablet server if using the nativemap. I think the implementation of that
is a balanced binary tree. So I do not think inserting at the beginning
vs
the end would make difference. That being said, I do not think I have
tried this so I do not know if there would be any suprises. I would be
interested in hearing about your experiences.
On Tue, Nov 27, 2012 at 11:51 AM, Keith Turner ke...@deenlo.comwrote:
Keith
On Tue, Nov 27, 2012 at 10:41 AM, Roshan Punnoose rosh...@gmail.com
wrote:
I want to have a table where the row will consist of
string-reverse index timestamp. But this means that the data is
always being prefixed to the beginning of the row (or tablet if the
row is
large). Will this be a problem for compaction or performance?
Can you tell me more about what string is? For example is it a
hash or does it come from the set foo1,foo2,foo3. How does it
change over time? I think the answer to your question depends on what
string is.
I don't know if I heard this correctly, but someone once mentioned
that making the row id the direct timestamp could cause performance
issues
because data is always going to one tablet, but also because there is
trouble splitting since it always appends to the tablet. Is this true,
is
it similar to what could happen if I am always prefixing to a tablet?
Yes using a timestamp for a row could cause data from many clients
to always go to the same tablet, which would be bad for performance on a
cluster.
Thanks!
Roshan