Re: Unbalanced cluster with RandomPartitioner

aaron morton Mon, 23 Jan 2012 00:47:30 -0800

Setting a token outside of the partitioner range sounds like a bug. It's mostly 
an issue with the RP, but I guess a custom partitioner may also want to 
validate tokens are within a range.


Can you report it to https://issues.apache.org/jira/browse/CASSANDRA

Thanks


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 22/01/2012, at 5:58 AM, Marcel Steinbach wrote:

> I thought about our issue again and was thinking, maybe the describeOwnership 
> should take into account, if a token is outside the partitioners maximum 
> token range?
> 
> To recap our problem: we had tokens, that were apart by 12.5% of the token 
> range 2**127, however, we had an offset on each token, which moved the 
> cluster's token range above 2**127. That resulted in two nodes getting almost 
> none or none primary replicas. 
> 
> Afaik, the partitioner itself describes the key ownership in the ring, but it 
> didn't take into account that we left its maximum key range. 
> 
> Of course, it  is silly and not very likely that users make that mistake, 
> however, we did it, and it took me quite some time to figure that out (maybe 
> also because it wasn't me that setup the cluster). 
> 
> To carry it to the extreme, you could construct a cluster of  n nodes with 
> all tokens greater than 2**127, the ownership description would show a 
> ownership of 1/n each but all data would go to the node with the lowest token 
> (given RP and RF=1).
> 
> I think it is wrong to calculate the ownership by subtracting the previous 
> token from the current token and divide it by the maximum token without 
> acknowledging we already might be "out of bounds". 
> 
> Cheers 
> Marcel
> 
> On 20.01.2012, at 16:28, Marcel Steinbach wrote:
> 
>> Thanks for all the responses!
>> 
>> I found our problem:
>> Using the Random Partitioner, the key range is from 0..2**127.When we added 
>> nodes, we generated the keys and out of convenience, we added an offset to 
>> the tokens because the move was easier like that.
>> 
>> However, we did not execute the modulo 2**127 for the last two tokens, so 
>> they were outside the RP's key range. 
>> moving the last two tokens to their mod 2**127 will resolve the problem.
>> 
>> Cheers,
>> Marcel
>> 
>> On 20.01.2012, at 10:32, Marcel Steinbach wrote:
>> 
>>> On 19.01.2012, at 20:15, Narendra Sharma wrote:
>>>> I believe you need to move the nodes on the ring. What was the load on the 
>>>> nodes before you added 5 new nodes? Its just that you are getting data in 
>>>> certain token range more than others.
>>> With three nodes, it was also imbalanced. 
>>> 
>>> What I don't understand is, why the md5 sums would generate such massive 
>>> hot spots. 
>>> 
>>> Most of our keys look like that: 
>>> 00013270494972450001234567
>>> with the first 16 digits being a timestamp of one of our application 
>>> server's startup times, and the last 10 digits being sequentially generated 
>>> per user. 
>>> 
>>> There may be a lot of keys that start with e.g. "0001327049497245"  (or 
>>> some other time stamp). But I was under the impression that md5 doesn't 
>>> bother and generates uniform distribution?
>>> But then again, I know next to nothing about md5. Maybe someone else has a 
>>> better insight to the algorithm?
>>> 
>>> However, we also use cfs with a date ("yyyymmdd") as key, as well as cfs 
>>> with uuids as keys. And those cfs in itself are not balanced either. E.g. 
>>> node 5 has 12 GB live space used in the cf the uuid as key, and node 8 only 
>>> 428MB. 
>>> 
>>> Cheers,
>>> Marcel
>>> 
>>>> 
>>>> On Thu, Jan 19, 2012 at 3:22 AM, Marcel Steinbach 
>>>> <marcel.steinb...@chors.de> wrote:
>>>> On 18.01.2012, at 02:19, Maki Watanabe wrote:
>>>>> Are there any significant difference of number of sstables on each nodes?
>>>> No, no significant difference there. Actually, node 8 is among those with 
>>>> more sstables but with the least load (20GB)
>>>> 
>>>> On 17.01.2012, at 20:14, Jeremiah Jordan wrote:
>>>>> Are you deleting data or using TTL's?  Expired/deleted data won't go away 
>>>>> until the sstable holding it is compacted.  So if compaction has happened 
>>>>> on some nodes, but not on others, you will see this.  The disparity is 
>>>>> pretty big 400Gb to 20GB, so this probably isn't the issue, but with our 
>>>>> data using TTL's if I run major compactions a couple times on that column 
>>>>> family it can shrink ~30%-40%.
>>>> Yes, we do delete data. But I agree, the disparity is too big to blame 
>>>> only the deletions. 
>>>> 
>>>> Also, initially, we started out with 3 nodes and upgraded to 8 a few weeks 
>>>> ago. After adding the node, we did
>>>> compactions and cleanups and didn't have a balanced cluster. So that 
>>>> should have removed outdated data, right?
>>>> 
>>>>> 2012/1/18 Marcel Steinbach <marcel.steinb...@chors.de>:
>>>>>> We are running regular repairs, so I don't think that's the problem.
>>>>>> And the data dir sizes match approx. the load from the nodetool.
>>>>>> Thanks for the advise, though.
>>>>>> 
>>>>>> Our keys are digits only, and all contain a few zeros at the same
>>>>>> offsets. I'm not that familiar with the md5 algorithm, but I doubt that 
>>>>>> it
>>>>>> would generate 'hotspots' for those kind of keys, right?
>>>>>> 
>>>>>> On 17.01.2012, at 17:34, Mohit Anchlia wrote:
>>>>>> 
>>>>>> Have you tried running repair first on each node? Also, verify using
>>>>>> df -h on the data dirs
>>>>>> 
>>>>>> On Tue, Jan 17, 2012 at 7:34 AM, Marcel Steinbach
>>>>>> <marcel.steinb...@chors.de> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> 
>>>>>> we're using RP and have each node assigned the same amount of the token
>>>>>> space. The cluster looks like that:
>>>>>> 
>>>>>> 
>>>>>> Address         Status State   Load            Owns    Token
>>>>>> 
>>>>>> 
>>>>>> 205648943402372032879374446248852460236
>>>>>> 
>>>>>> 1       Up     Normal  310.83 GB       12.50%
>>>>>> 56775407874461455114148055497453867724
>>>>>> 
>>>>>> 2       Up     Normal  470.24 GB       12.50%
>>>>>> 78043055807020109080608968461939380940
>>>>>> 
>>>>>> 3       Up     Normal  271.57 GB       12.50%
>>>>>> 99310703739578763047069881426424894156
>>>>>> 
>>>>>> 4       Up     Normal  282.61 GB       12.50%
>>>>>> 120578351672137417013530794390910407372
>>>>>> 
>>>>>> 5       Up     Normal  248.76 GB       12.50%
>>>>>> 141845999604696070979991707355395920588
>>>>>> 
>>>>>> 6       Up     Normal  164.12 GB       12.50%
>>>>>> 163113647537254724946452620319881433804
>>>>>> 
>>>>>> 7       Up     Normal  76.23 GB        12.50%
>>>>>> 184381295469813378912913533284366947020
>>>>>> 
>>>>>> 8       Up     Normal  19.79 GB        12.50%
>>>>>> 205648943402372032879374446248852460236
>>>>>> 
>>>>>> 
>>>>>> I was under the impression, the RP would distribute the load more evenly.
>>>>>> 
>>>>>> Our row sizes are 0,5-1 KB, hence, we don't store huge rows on a single
>>>>>> node. Should we just move the nodes so that the load is more even
>>>>>> distributed, or is there something off that needs to be fixed first?
>>>>>> 
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> Marcel
>>>>>> 
>>>>>> <hr style="border-color:blue">
>>>>>> 
>>>>>> <p>chors GmbH
>>>>>> 
>>>>>> <br><hr style="border-color:blue">
>>>>>> 
>>>>>> <p>specialists in digital and direct marketing solutions<br>
>>>>>> 
>>>>>> Haid-und-Neu-Straße 7<br>
>>>>>> 
>>>>>> 76131 Karlsruhe, Germany<br>
>>>>>> 
>>>>>> www.chors.com</p>
>>>>>> 
>>>>>> <p>Managing Directors: Dr. Volker Hatz, Markus Plattner<br>Amtsgericht
>>>>>> Montabaur, HRB 15029</p>
>>>>>> 
>>>>>> <p style="font-size:9px">This e-mail is for the intended recipient only 
>>>>>> and
>>>>>> may contain confidential or privileged information. If you have received
>>>>>> this e-mail by mistake, please contact us immediately and completely 
>>>>>> delete
>>>>>> it (and any attachments) and do not forward it or inform any other 
>>>>>> person of
>>>>>> its contents. If you send us messages by e-mail, we take this as your
>>>>>> authorization to correspond with you by e-mail. E-mail transmission 
>>>>>> cannot
>>>>>> be guaranteed to be secure or error-free as information could be
>>>>>> intercepted, amended, corrupted, lost, destroyed, arrive late or 
>>>>>> incomplete,
>>>>>> or contain viruses. Neither chors GmbH nor the sender accept liability 
>>>>>> for
>>>>>> any errors or omissions in the content of this message which arise as a
>>>>>> result of its e-mail transmission. Please note that all e-mail
>>>>>> communications to and from chors GmbH may be monitored.</p>
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> w3m
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Narendra Sharma
>>>> Software Engineer
>>>> http://www.aeris.com
>>>> http://narendrasharma.blogspot.com/
>>>> 
>>>> 
>> 
>

Re: Unbalanced cluster with RandomPartitioner

Reply via email to