Setting a token outside of the partitioner range sounds like a bug. It's mostly an issue with the RP, but I guess a custom partitioner may also want to validate tokens are within a range.
Can you report it to https://issues.apache.org/jira/browse/CASSANDRA Thanks ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 22/01/2012, at 5:58 AM, Marcel Steinbach wrote: > I thought about our issue again and was thinking, maybe the describeOwnership > should take into account, if a token is outside the partitioners maximum > token range? > > To recap our problem: we had tokens, that were apart by 12.5% of the token > range 2**127, however, we had an offset on each token, which moved the > cluster's token range above 2**127. That resulted in two nodes getting almost > none or none primary replicas. > > Afaik, the partitioner itself describes the key ownership in the ring, but it > didn't take into account that we left its maximum key range. > > Of course, it is silly and not very likely that users make that mistake, > however, we did it, and it took me quite some time to figure that out (maybe > also because it wasn't me that setup the cluster). > > To carry it to the extreme, you could construct a cluster of n nodes with > all tokens greater than 2**127, the ownership description would show a > ownership of 1/n each but all data would go to the node with the lowest token > (given RP and RF=1). > > I think it is wrong to calculate the ownership by subtracting the previous > token from the current token and divide it by the maximum token without > acknowledging we already might be "out of bounds". > > Cheers > Marcel > > On 20.01.2012, at 16:28, Marcel Steinbach wrote: > >> Thanks for all the responses! >> >> I found our problem: >> Using the Random Partitioner, the key range is from 0..2**127.When we added >> nodes, we generated the keys and out of convenience, we added an offset to >> the tokens because the move was easier like that. >> >> However, we did not execute the modulo 2**127 for the last two tokens, so >> they were outside the RP's key range. >> moving the last two tokens to their mod 2**127 will resolve the problem. >> >> Cheers, >> Marcel >> >> On 20.01.2012, at 10:32, Marcel Steinbach wrote: >> >>> On 19.01.2012, at 20:15, Narendra Sharma wrote: >>>> I believe you need to move the nodes on the ring. What was the load on the >>>> nodes before you added 5 new nodes? Its just that you are getting data in >>>> certain token range more than others. >>> With three nodes, it was also imbalanced. >>> >>> What I don't understand is, why the md5 sums would generate such massive >>> hot spots. >>> >>> Most of our keys look like that: >>> 00013270494972450001234567 >>> with the first 16 digits being a timestamp of one of our application >>> server's startup times, and the last 10 digits being sequentially generated >>> per user. >>> >>> There may be a lot of keys that start with e.g. "0001327049497245" (or >>> some other time stamp). But I was under the impression that md5 doesn't >>> bother and generates uniform distribution? >>> But then again, I know next to nothing about md5. Maybe someone else has a >>> better insight to the algorithm? >>> >>> However, we also use cfs with a date ("yyyymmdd") as key, as well as cfs >>> with uuids as keys. And those cfs in itself are not balanced either. E.g. >>> node 5 has 12 GB live space used in the cf the uuid as key, and node 8 only >>> 428MB. >>> >>> Cheers, >>> Marcel >>> >>>> >>>> On Thu, Jan 19, 2012 at 3:22 AM, Marcel Steinbach >>>> <marcel.steinb...@chors.de> wrote: >>>> On 18.01.2012, at 02:19, Maki Watanabe wrote: >>>>> Are there any significant difference of number of sstables on each nodes? >>>> No, no significant difference there. Actually, node 8 is among those with >>>> more sstables but with the least load (20GB) >>>> >>>> On 17.01.2012, at 20:14, Jeremiah Jordan wrote: >>>>> Are you deleting data or using TTL's? Expired/deleted data won't go away >>>>> until the sstable holding it is compacted. So if compaction has happened >>>>> on some nodes, but not on others, you will see this. The disparity is >>>>> pretty big 400Gb to 20GB, so this probably isn't the issue, but with our >>>>> data using TTL's if I run major compactions a couple times on that column >>>>> family it can shrink ~30%-40%. >>>> Yes, we do delete data. But I agree, the disparity is too big to blame >>>> only the deletions. >>>> >>>> Also, initially, we started out with 3 nodes and upgraded to 8 a few weeks >>>> ago. After adding the node, we did >>>> compactions and cleanups and didn't have a balanced cluster. So that >>>> should have removed outdated data, right? >>>> >>>>> 2012/1/18 Marcel Steinbach <marcel.steinb...@chors.de>: >>>>>> We are running regular repairs, so I don't think that's the problem. >>>>>> And the data dir sizes match approx. the load from the nodetool. >>>>>> Thanks for the advise, though. >>>>>> >>>>>> Our keys are digits only, and all contain a few zeros at the same >>>>>> offsets. I'm not that familiar with the md5 algorithm, but I doubt that >>>>>> it >>>>>> would generate 'hotspots' for those kind of keys, right? >>>>>> >>>>>> On 17.01.2012, at 17:34, Mohit Anchlia wrote: >>>>>> >>>>>> Have you tried running repair first on each node? Also, verify using >>>>>> df -h on the data dirs >>>>>> >>>>>> On Tue, Jan 17, 2012 at 7:34 AM, Marcel Steinbach >>>>>> <marcel.steinb...@chors.de> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> we're using RP and have each node assigned the same amount of the token >>>>>> space. The cluster looks like that: >>>>>> >>>>>> >>>>>> Address Status State Load Owns Token >>>>>> >>>>>> >>>>>> 205648943402372032879374446248852460236 >>>>>> >>>>>> 1 Up Normal 310.83 GB 12.50% >>>>>> 56775407874461455114148055497453867724 >>>>>> >>>>>> 2 Up Normal 470.24 GB 12.50% >>>>>> 78043055807020109080608968461939380940 >>>>>> >>>>>> 3 Up Normal 271.57 GB 12.50% >>>>>> 99310703739578763047069881426424894156 >>>>>> >>>>>> 4 Up Normal 282.61 GB 12.50% >>>>>> 120578351672137417013530794390910407372 >>>>>> >>>>>> 5 Up Normal 248.76 GB 12.50% >>>>>> 141845999604696070979991707355395920588 >>>>>> >>>>>> 6 Up Normal 164.12 GB 12.50% >>>>>> 163113647537254724946452620319881433804 >>>>>> >>>>>> 7 Up Normal 76.23 GB 12.50% >>>>>> 184381295469813378912913533284366947020 >>>>>> >>>>>> 8 Up Normal 19.79 GB 12.50% >>>>>> 205648943402372032879374446248852460236 >>>>>> >>>>>> >>>>>> I was under the impression, the RP would distribute the load more evenly. >>>>>> >>>>>> Our row sizes are 0,5-1 KB, hence, we don't store huge rows on a single >>>>>> node. Should we just move the nodes so that the load is more even >>>>>> distributed, or is there something off that needs to be fixed first? >>>>>> >>>>>> >>>>>> Thanks >>>>>> >>>>>> Marcel >>>>>> >>>>>> <hr style="border-color:blue"> >>>>>> >>>>>> <p>chors GmbH >>>>>> >>>>>> <br><hr style="border-color:blue"> >>>>>> >>>>>> <p>specialists in digital and direct marketing solutions<br> >>>>>> >>>>>> Haid-und-Neu-Straße 7<br> >>>>>> >>>>>> 76131 Karlsruhe, Germany<br> >>>>>> >>>>>> www.chors.com</p> >>>>>> >>>>>> <p>Managing Directors: Dr. Volker Hatz, Markus Plattner<br>Amtsgericht >>>>>> Montabaur, HRB 15029</p> >>>>>> >>>>>> <p style="font-size:9px">This e-mail is for the intended recipient only >>>>>> and >>>>>> may contain confidential or privileged information. If you have received >>>>>> this e-mail by mistake, please contact us immediately and completely >>>>>> delete >>>>>> it (and any attachments) and do not forward it or inform any other >>>>>> person of >>>>>> its contents. If you send us messages by e-mail, we take this as your >>>>>> authorization to correspond with you by e-mail. E-mail transmission >>>>>> cannot >>>>>> be guaranteed to be secure or error-free as information could be >>>>>> intercepted, amended, corrupted, lost, destroyed, arrive late or >>>>>> incomplete, >>>>>> or contain viruses. Neither chors GmbH nor the sender accept liability >>>>>> for >>>>>> any errors or omissions in the content of this message which arise as a >>>>>> result of its e-mail transmission. Please note that all e-mail >>>>>> communications to and from chors GmbH may be monitored.</p> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> w3m >>>> >>>> >>>> >>>> >>>> -- >>>> Narendra Sharma >>>> Software Engineer >>>> http://www.aeris.com >>>> http://narendrasharma.blogspot.com/ >>>> >>>> >> >