I thought about our issue again and was thinking, maybe the describeOwnership 
should take into account, if a token is outside the partitioners maximum token 
range?

To recap our problem: we had tokens, that were apart by 12.5% of the token 
range 2**127, however, we had an offset on each token, which moved the 
cluster's token range above 2**127. That resulted in two nodes getting almost 
none or none primary replicas. 

Afaik, the partitioner itself describes the key ownership in the ring, but it 
didn't take into account that we left its maximum key range. 

Of course, it  is silly and not very likely that users make that mistake, 
however, we did it, and it took me quite some time to figure that out (maybe 
also because it wasn't me that setup the cluster). 

To carry it to the extreme, you could construct a cluster of  n nodes with all 
tokens greater than 2**127, the ownership description would show a ownership of 
1/n each but all data would go to the node with the lowest token (given RP and 
RF=1).

I think it is wrong to calculate the ownership by subtracting the previous 
token from the current token and divide it by the maximum token without 
acknowledging we already might be "out of bounds". 

Cheers 
Marcel

On 20.01.2012, at 16:28, Marcel Steinbach wrote:

> Thanks for all the responses!
> 
> I found our problem:
> Using the Random Partitioner, the key range is from 0..2**127.When we added 
> nodes, we generated the keys and out of convenience, we added an offset to 
> the tokens because the move was easier like that.
> 
> However, we did not execute the modulo 2**127 for the last two tokens, so 
> they were outside the RP's key range. 
> moving the last two tokens to their mod 2**127 will resolve the problem.
> 
> Cheers,
> Marcel
> 
> On 20.01.2012, at 10:32, Marcel Steinbach wrote:
> 
>> On 19.01.2012, at 20:15, Narendra Sharma wrote:
>>> I believe you need to move the nodes on the ring. What was the load on the 
>>> nodes before you added 5 new nodes? Its just that you are getting data in 
>>> certain token range more than others.
>> With three nodes, it was also imbalanced. 
>> 
>> What I don't understand is, why the md5 sums would generate such massive hot 
>> spots. 
>> 
>> Most of our keys look like that: 
>> 00013270494972450001234567
>> with the first 16 digits being a timestamp of one of our application 
>> server's startup times, and the last 10 digits being sequentially generated 
>> per user. 
>> 
>> There may be a lot of keys that start with e.g. "0001327049497245"  (or some 
>> other time stamp). But I was under the impression that md5 doesn't bother 
>> and generates uniform distribution?
>> But then again, I know next to nothing about md5. Maybe someone else has a 
>> better insight to the algorithm?
>> 
>> However, we also use cfs with a date ("yyyymmdd") as key, as well as cfs 
>> with uuids as keys. And those cfs in itself are not balanced either. E.g. 
>> node 5 has 12 GB live space used in the cf the uuid as key, and node 8 only 
>> 428MB. 
>> 
>> Cheers,
>> Marcel
>> 
>>> 
>>> On Thu, Jan 19, 2012 at 3:22 AM, Marcel Steinbach 
>>> <marcel.steinb...@chors.de> wrote:
>>> On 18.01.2012, at 02:19, Maki Watanabe wrote:
>>>> Are there any significant difference of number of sstables on each nodes?
>>> No, no significant difference there. Actually, node 8 is among those with 
>>> more sstables but with the least load (20GB)
>>> 
>>> On 17.01.2012, at 20:14, Jeremiah Jordan wrote:
>>>> Are you deleting data or using TTL's?  Expired/deleted data won't go away 
>>>> until the sstable holding it is compacted.  So if compaction has happened 
>>>> on some nodes, but not on others, you will see this.  The disparity is 
>>>> pretty big 400Gb to 20GB, so this probably isn't the issue, but with our 
>>>> data using TTL's if I run major compactions a couple times on that column 
>>>> family it can shrink ~30%-40%.
>>> Yes, we do delete data. But I agree, the disparity is too big to blame only 
>>> the deletions. 
>>> 
>>> Also, initially, we started out with 3 nodes and upgraded to 8 a few weeks 
>>> ago. After adding the node, we did
>>> compactions and cleanups and didn't have a balanced cluster. So that should 
>>> have removed outdated data, right?
>>> 
>>>> 2012/1/18 Marcel Steinbach <marcel.steinb...@chors.de>:
>>>>> We are running regular repairs, so I don't think that's the problem.
>>>>> And the data dir sizes match approx. the load from the nodetool.
>>>>> Thanks for the advise, though.
>>>>> 
>>>>> Our keys are digits only, and all contain a few zeros at the same
>>>>> offsets. I'm not that familiar with the md5 algorithm, but I doubt that it
>>>>> would generate 'hotspots' for those kind of keys, right?
>>>>> 
>>>>> On 17.01.2012, at 17:34, Mohit Anchlia wrote:
>>>>> 
>>>>> Have you tried running repair first on each node? Also, verify using
>>>>> df -h on the data dirs
>>>>> 
>>>>> On Tue, Jan 17, 2012 at 7:34 AM, Marcel Steinbach
>>>>> <marcel.steinb...@chors.de> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> 
>>>>> we're using RP and have each node assigned the same amount of the token
>>>>> space. The cluster looks like that:
>>>>> 
>>>>> 
>>>>> Address         Status State   Load            Owns    Token
>>>>> 
>>>>> 
>>>>> 205648943402372032879374446248852460236
>>>>> 
>>>>> 1       Up     Normal  310.83 GB       12.50%
>>>>> 56775407874461455114148055497453867724
>>>>> 
>>>>> 2       Up     Normal  470.24 GB       12.50%
>>>>> 78043055807020109080608968461939380940
>>>>> 
>>>>> 3       Up     Normal  271.57 GB       12.50%
>>>>> 99310703739578763047069881426424894156
>>>>> 
>>>>> 4       Up     Normal  282.61 GB       12.50%
>>>>> 120578351672137417013530794390910407372
>>>>> 
>>>>> 5       Up     Normal  248.76 GB       12.50%
>>>>> 141845999604696070979991707355395920588
>>>>> 
>>>>> 6       Up     Normal  164.12 GB       12.50%
>>>>> 163113647537254724946452620319881433804
>>>>> 
>>>>> 7       Up     Normal  76.23 GB        12.50%
>>>>> 184381295469813378912913533284366947020
>>>>> 
>>>>> 8       Up     Normal  19.79 GB        12.50%
>>>>> 205648943402372032879374446248852460236
>>>>> 
>>>>> 
>>>>> I was under the impression, the RP would distribute the load more evenly.
>>>>> 
>>>>> Our row sizes are 0,5-1 KB, hence, we don't store huge rows on a single
>>>>> node. Should we just move the nodes so that the load is more even
>>>>> distributed, or is there something off that needs to be fixed first?
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Marcel
>>>>> 
>>>>> <hr style="border-color:blue">
>>>>> 
>>>>> <p>chors GmbH
>>>>> 
>>>>> <br><hr style="border-color:blue">
>>>>> 
>>>>> <p>specialists in digital and direct marketing solutions<br>
>>>>> 
>>>>> Haid-und-Neu-Straße 7<br>
>>>>> 
>>>>> 76131 Karlsruhe, Germany<br>
>>>>> 
>>>>> www.chors.com</p>
>>>>> 
>>>>> <p>Managing Directors: Dr. Volker Hatz, Markus Plattner<br>Amtsgericht
>>>>> Montabaur, HRB 15029</p>
>>>>> 
>>>>> <p style="font-size:9px">This e-mail is for the intended recipient only 
>>>>> and
>>>>> may contain confidential or privileged information. If you have received
>>>>> this e-mail by mistake, please contact us immediately and completely 
>>>>> delete
>>>>> it (and any attachments) and do not forward it or inform any other person 
>>>>> of
>>>>> its contents. If you send us messages by e-mail, we take this as your
>>>>> authorization to correspond with you by e-mail. E-mail transmission cannot
>>>>> be guaranteed to be secure or error-free as information could be
>>>>> intercepted, amended, corrupted, lost, destroyed, arrive late or 
>>>>> incomplete,
>>>>> or contain viruses. Neither chors GmbH nor the sender accept liability for
>>>>> any errors or omissions in the content of this message which arise as a
>>>>> result of its e-mail transmission. Please note that all e-mail
>>>>> communications to and from chors GmbH may be monitored.</p>
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> w3m
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Narendra Sharma
>>> Software Engineer
>>> http://www.aeris.com
>>> http://narendrasharma.blogspot.com/
>>> 
>>> 
> 

Reply via email to