subject:"Repair Issues"

Re: Repair Issues

2019-10-26 Thread Ben Mills

Thanks Ghiyasi.

On Sat, Oct 26, 2019 at 9:17 AM Hossein Ghiyasi Mehr 
wrote:

> If the problem exist still, and all nodes are up, reboot them one by one.
> Then try to repair one node. After that repair other nodes one by one.
>
> On Fri, Oct 25, 2019 at 12:56 AM Ben Mills  wrote:
>
>>
>> Thanks Jon!
>>
>> This is very helpful - allow me to follow-up and ask a question.
>>
>> (1) Yes, incremental repairs will never be used (unless it becomes viable
>> in Cassandra 4.x someday).
>> (2) I hear you on the JVM - will look into that.
>> (3) Been looking at Cassandra version 3.11.x though was unaware that 3.7
>> is considered non-viable for production use.
>>
>> For (4) - Question/Request:
>>
>> Note that with:
>>
>> -XX:MaxRAMFraction=2
>>
>> the actual amount of memory allocated for heap space is effectively 2Gi
>> (i.e. half of the 4Gi allocated on the machine type). We can definitely
>> increase memory (for heap and nonheap), though can you expand a bit on your
>> heap comment to help my understanding (as this is such a small cluster with
>> such a small amount of data at rest)?
>>
>> Thanks again.
>>
>> On Thu, Oct 24, 2019 at 5:11 PM Jon Haddad  wrote:
>>
>>> There's some major warning signs for me with your environment.  4GB heap
>>> is too low, and Cassandra 3.7 isn't something I would put into production.
>>>
>>> Your surface area for problems is massive right now.  Things I'd do:
>>>
>>> 1. Never use incremental repair.  Seems like you've already stopped
>>> doing them, but it's worth mentioning.
>>> 2. Upgrade to the latest JVM, that version's way out of date.
>>> 3. Upgrade to Cassandra 3.11.latest (we're voting on 3.11.5 right now).
>>> 4. Increase memory to 8GB minimum, preferably 12.
>>>
>>> I usually don't like making a bunch of changes without knowing the root
>>> cause of a problem, but in your case there's so many potential problems I
>>> don't think it's worth digging into, especially since the problem might be
>>> one of the 500 or so bugs that were fixed since this release.
>>>
>>> Once you've done those things it'll be easier to narrow down the problem.
>>>
>>> Jon
>>>
>>>
>>> On Thu, Oct 24, 2019 at 4:59 PM Ben Mills  wrote:
>>>
>>>> Hi Sergio,
>>>>
>>>> No, not at this time.
>>>>
>>>> It was in use with this cluster previously, and while there were no
>>>> reaper-specific issues, it was removed to help simplify investigation of
>>>> the underlying repair issues I've described.
>>>>
>>>> Thanks.
>>>>
>>>> On Thu, Oct 24, 2019 at 4:21 PM Sergio 
>>>> wrote:
>>>>
>>>>> Are you using Cassandra reaper?
>>>>>
>>>>> On Thu, Oct 24, 2019, 12:31 PM Ben Mills  wrote:
>>>>>
>>>>>> Greetings,
>>>>>>
>>>>>> Inherited a small Cassandra cluster with some repair issues and need
>>>>>> some advice on recommended next steps. Apologies in advance for a long
>>>>>> email.
>>>>>>
>>>>>> Issue:
>>>>>>
>>>>>> Intermittent repair failures on two non-system keyspaces.
>>>>>>
>>>>>> - platform_users
>>>>>> - platform_management
>>>>>>
>>>>>> Repair Type:
>>>>>>
>>>>>> Full, parallel repairs are run on each of the three nodes every five
>>>>>> days.
>>>>>>
>>>>>> Repair command output for a typical failure:
>>>>>>
>>>>>> [2019-10-18 00:22:09,109] Starting repair command #46, repairing
>>>>>> keyspace platform_users with repair options (parallelism: parallel, 
>>>>>> primary
>>>>>> range: false, incremental: false, job threads: 1, ColumnFamilies: [],
>>>>>> dataCenters: [], hosts: [], # of ranges: 12)
>>>>>> [2019-10-18 00:22:09,242] Repair session
>>>>>> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
>>>>>> [(-1890954128429545684,2847510199483651721],
>>>>>> (8249813014782655320,-8746483007209345011],
>>>>>> (4299912178579297893,6811748355903297393],
>>>>>> (-8746483007209345011,-8628999431140554276],
>>>>>> (-5865769407232506956,-4746990901966533744],

Re: Repair Issues

2019-10-26 Thread Hossein Ghiyasi Mehr

If the problem exist still, and all nodes are up, reboot them one by one.
Then try to repair one node. After that repair other nodes one by one.

On Fri, Oct 25, 2019 at 12:56 AM Ben Mills  wrote:

>
> Thanks Jon!
>
> This is very helpful - allow me to follow-up and ask a question.
>
> (1) Yes, incremental repairs will never be used (unless it becomes viable
> in Cassandra 4.x someday).
> (2) I hear you on the JVM - will look into that.
> (3) Been looking at Cassandra version 3.11.x though was unaware that 3.7
> is considered non-viable for production use.
>
> For (4) - Question/Request:
>
> Note that with:
>
> -XX:MaxRAMFraction=2
>
> the actual amount of memory allocated for heap space is effectively 2Gi
> (i.e. half of the 4Gi allocated on the machine type). We can definitely
> increase memory (for heap and nonheap), though can you expand a bit on your
> heap comment to help my understanding (as this is such a small cluster with
> such a small amount of data at rest)?
>
> Thanks again.
>
> On Thu, Oct 24, 2019 at 5:11 PM Jon Haddad  wrote:
>
>> There's some major warning signs for me with your environment.  4GB heap
>> is too low, and Cassandra 3.7 isn't something I would put into production.
>>
>> Your surface area for problems is massive right now.  Things I'd do:
>>
>> 1. Never use incremental repair.  Seems like you've already stopped doing
>> them, but it's worth mentioning.
>> 2. Upgrade to the latest JVM, that version's way out of date.
>> 3. Upgrade to Cassandra 3.11.latest (we're voting on 3.11.5 right now).
>> 4. Increase memory to 8GB minimum, preferably 12.
>>
>> I usually don't like making a bunch of changes without knowing the root
>> cause of a problem, but in your case there's so many potential problems I
>> don't think it's worth digging into, especially since the problem might be
>> one of the 500 or so bugs that were fixed since this release.
>>
>> Once you've done those things it'll be easier to narrow down the problem.
>>
>> Jon
>>
>>
>> On Thu, Oct 24, 2019 at 4:59 PM Ben Mills  wrote:
>>
>>> Hi Sergio,
>>>
>>> No, not at this time.
>>>
>>> It was in use with this cluster previously, and while there were no
>>> reaper-specific issues, it was removed to help simplify investigation of
>>> the underlying repair issues I've described.
>>>
>>> Thanks.
>>>
>>> On Thu, Oct 24, 2019 at 4:21 PM Sergio 
>>> wrote:
>>>
>>>> Are you using Cassandra reaper?
>>>>
>>>> On Thu, Oct 24, 2019, 12:31 PM Ben Mills  wrote:
>>>>
>>>>> Greetings,
>>>>>
>>>>> Inherited a small Cassandra cluster with some repair issues and need
>>>>> some advice on recommended next steps. Apologies in advance for a long
>>>>> email.
>>>>>
>>>>> Issue:
>>>>>
>>>>> Intermittent repair failures on two non-system keyspaces.
>>>>>
>>>>> - platform_users
>>>>> - platform_management
>>>>>
>>>>> Repair Type:
>>>>>
>>>>> Full, parallel repairs are run on each of the three nodes every five
>>>>> days.
>>>>>
>>>>> Repair command output for a typical failure:
>>>>>
>>>>> [2019-10-18 00:22:09,109] Starting repair command #46, repairing
>>>>> keyspace platform_users with repair options (parallelism: parallel, 
>>>>> primary
>>>>> range: false, incremental: false, job threads: 1, ColumnFamilies: [],
>>>>> dataCenters: [], hosts: [], # of ranges: 12)
>>>>> [2019-10-18 00:22:09,242] Repair session
>>>>> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
>>>>> [(-1890954128429545684,2847510199483651721],
>>>>> (8249813014782655320,-8746483007209345011],
>>>>> (4299912178579297893,6811748355903297393],
>>>>> (-8746483007209345011,-8628999431140554276],
>>>>> (-5865769407232506956,-4746990901966533744],
>>>>> (-4470950459111056725,-1890954128429545684],
>>>>> (4001531392883953257,4299912178579297893],
>>>>> (6811748355903297393,6878104809564599690],
>>>>> (6878104809564599690,8249813014782655320],
>>>>> (-4746990901966533744,-4470950459111056725],
>>>>> (-8628999431140554276,-5865769407232506956],
>>>>> (2847510199483651721,4001531392883953257]] failed with error [repair
>>>>&g

Re: Repair Issues

2019-10-24 Thread Ben Mills

Thanks Jon!

This is very helpful - allow me to follow-up and ask a question.

(1) Yes, incremental repairs will never be used (unless it becomes viable
in Cassandra 4.x someday).
(2) I hear you on the JVM - will look into that.
(3) Been looking at Cassandra version 3.11.x though was unaware that 3.7 is
considered non-viable for production use.

For (4) - Question/Request:

Note that with:

-XX:MaxRAMFraction=2

the actual amount of memory allocated for heap space is effectively 2Gi
(i.e. half of the 4Gi allocated on the machine type). We can definitely
increase memory (for heap and nonheap), though can you expand a bit on your
heap comment to help my understanding (as this is such a small cluster with
such a small amount of data at rest)?

Thanks again.

On Thu, Oct 24, 2019 at 5:11 PM Jon Haddad  wrote:

> There's some major warning signs for me with your environment.  4GB heap
> is too low, and Cassandra 3.7 isn't something I would put into production.
>
> Your surface area for problems is massive right now.  Things I'd do:
>
> 1. Never use incremental repair.  Seems like you've already stopped doing
> them, but it's worth mentioning.
> 2. Upgrade to the latest JVM, that version's way out of date.
> 3. Upgrade to Cassandra 3.11.latest (we're voting on 3.11.5 right now).
> 4. Increase memory to 8GB minimum, preferably 12.
>
> I usually don't like making a bunch of changes without knowing the root
> cause of a problem, but in your case there's so many potential problems I
> don't think it's worth digging into, especially since the problem might be
> one of the 500 or so bugs that were fixed since this release.
>
> Once you've done those things it'll be easier to narrow down the problem.
>
> Jon
>
>
> On Thu, Oct 24, 2019 at 4:59 PM Ben Mills  wrote:
>
>> Hi Sergio,
>>
>> No, not at this time.
>>
>> It was in use with this cluster previously, and while there were no
>> reaper-specific issues, it was removed to help simplify investigation of
>> the underlying repair issues I've described.
>>
>> Thanks.
>>
>> On Thu, Oct 24, 2019 at 4:21 PM Sergio  wrote:
>>
>>> Are you using Cassandra reaper?
>>>
>>> On Thu, Oct 24, 2019, 12:31 PM Ben Mills  wrote:
>>>
>>>> Greetings,
>>>>
>>>> Inherited a small Cassandra cluster with some repair issues and need
>>>> some advice on recommended next steps. Apologies in advance for a long
>>>> email.
>>>>
>>>> Issue:
>>>>
>>>> Intermittent repair failures on two non-system keyspaces.
>>>>
>>>> - platform_users
>>>> - platform_management
>>>>
>>>> Repair Type:
>>>>
>>>> Full, parallel repairs are run on each of the three nodes every five
>>>> days.
>>>>
>>>> Repair command output for a typical failure:
>>>>
>>>> [2019-10-18 00:22:09,109] Starting repair command #46, repairing
>>>> keyspace platform_users with repair options (parallelism: parallel, primary
>>>> range: false, incremental: false, job threads: 1, ColumnFamilies: [],
>>>> dataCenters: [], hosts: [], # of ranges: 12)
>>>> [2019-10-18 00:22:09,242] Repair session
>>>> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
>>>> [(-1890954128429545684,2847510199483651721],
>>>> (8249813014782655320,-8746483007209345011],
>>>> (4299912178579297893,6811748355903297393],
>>>> (-8746483007209345011,-8628999431140554276],
>>>> (-5865769407232506956,-4746990901966533744],
>>>> (-4470950459111056725,-1890954128429545684],
>>>> (4001531392883953257,4299912178579297893],
>>>> (6811748355903297393,6878104809564599690],
>>>> (6878104809564599690,8249813014782655320],
>>>> (-4746990901966533744,-4470950459111056725],
>>>> (-8628999431140554276,-5865769407232506956],
>>>> (2847510199483651721,4001531392883953257]] failed with error [repair
>>>> #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2,
>>>> [(-1890954128429545684,2847510199483651721],
>>>> (8249813014782655320,-8746483007209345011],
>>>> (4299912178579297893,6811748355903297393],
>>>> (-8746483007209345011,-8628999431140554276],
>>>> (-5865769407232506956,-4746990901966533744],
>>>> (-4470950459111056725,-1890954128429545684],
>>>> (4001531392883953257,4299912178579297893],
>>>> (6811748355903297393,6878104809564599690],
>>>> (6878104809564599690,8249813014782655320],
>>>> (-4746990901966533744,-44

Re: Repair Issues

2019-10-24 Thread Ben Mills

Hi Reid,

Many thanks - I have seen that article though will definitely give it
another read.

Note that nodetool scrub has been tried (no effect) and sstablescrub cannot
currently be run with the Cassandra image in use (though certainly a new
image that allows the server to be stopped but keeps the operating
environment available to use this utility can be built - just haven't done
so yet). Note also that none of the logs are indicating that a corrupt
data file (or files) is in play here. Noting that because the article
includes a solution where a specific data file is manually deleted and then
repairs are run to restore the file from a different node in the cluster.
Also, the way persistent volumes are mounted onto [Kubernetes] nodes
prevents this solution (manual deletion of an offending data file) from
being viable because the PV mount on the node's filesystem is detached when
the pods are down. This is a subtlety of running Cassandra in Kubernetes.

On Thu, Oct 24, 2019 at 4:24 PM Reid Pinchback 
wrote:

> Ben, you may find this helpful:
>
>
>
> https://blog.pythian.com/so-you-have-a-broken-cassandra-sstable-file/
>
>
>
>
>
> *From: *Ben Mills 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Thursday, October 24, 2019 at 3:31 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Repair Issues
>
>
>
> *Message from External Sender*
>
> Greetings,
>
> Inherited a small Cassandra cluster with some repair issues and need some
> advice on recommended next steps. Apologies in advance for a long email.
>
> Issue:
>
> Intermittent repair failures on two non-system keyspaces.
>
> - platform_users
> - platform_management
>
> Repair Type:
>
> Full, parallel repairs are run on each of the three nodes every five days.
>
> Repair command output for a typical failure:
>
> [2019-10-18 00:22:09,109] Starting repair command #46, repairing keyspace
> platform_users with repair options (parallelism: parallel, primary range:
> false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters:
> [], hosts: [], # of ranges: 12)
> [2019-10-18 00:22:09,242] Repair session
> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
> [(-1890954128429545684,2847510199483651721],
> (8249813014782655320,-8746483007209345011],
> (4299912178579297893,6811748355903297393],
> (-8746483007209345011,-8628999431140554276],
> (-5865769407232506956,-4746990901966533744],
> (-4470950459111056725,-1890954128429545684],
> (4001531392883953257,4299912178579297893],
> (6811748355903297393,6878104809564599690],
> (6878104809564599690,8249813014782655320],
> (-4746990901966533744,-4470950459111056725],
> (-8628999431140554276,-5865769407232506956],
> (2847510199483651721,4001531392883953257]] failed with error [repair
> #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2,
> [(-1890954128429545684,2847510199483651721],
> (8249813014782655320,-8746483007209345011],
> (4299912178579297893,6811748355903297393],
> (-8746483007209345011,-8628999431140554276],
> (-5865769407232506956,-4746990901966533744],
> (-4470950459111056725,-1890954128429545684],
> (4001531392883953257,4299912178579297893],
> (6811748355903297393,6878104809564599690],
> (6878104809564599690,8249813014782655320],
> (-4746990901966533744,-4470950459111056725],
> (-8628999431140554276,-5865769407232506956],
> (2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x
> (progress: 26%)
> [2019-10-18 00:22:09,246] Some repair failed
> [2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds
>
> Additional Notes:
>
> Repairs encounter above failures more often than not. Sometimes on one
> node only, though occasionally on two. Sometimes just one of the two
> keyspaces, sometimes both. Apparently the previous repair schedule for
> this cluster included incremental repairs (script alternated between
> incremental and full repairs). After reading this TLP article:
>
>
> https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__thelastpickle.com_blog_2017_12_14_should-2Dyou-2Duse-2Dincremental-2Drepair.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=IS_T0jkqMzq1WUvU2M2bsp86B8WWcNuhUoWjudSR_t0=s4UG2uUbhDqyEE7itCF4vYdDQTg7kxJ6LcipRE71Jqw=>
>
> the repair script was replaced with cassandra-reaper (v1.4.0), which was
> run with its default configs. Reaper was fine but only obscured the ongoing
> issues (it did not resolve them) and complicated the debugging process and
> so was then removed. The current repair schedule is as described above
> under Repair Type.
>
> Attempts at Resolution:
>
> (1) nodetool

Re: Repair Issues

2019-10-24 Thread Jon Haddad

There's some major warning signs for me with your environment.  4GB heap is
too low, and Cassandra 3.7 isn't something I would put into production.

Your surface area for problems is massive right now.  Things I'd do:

1. Never use incremental repair.  Seems like you've already stopped doing
them, but it's worth mentioning.
2. Upgrade to the latest JVM, that version's way out of date.
3. Upgrade to Cassandra 3.11.latest (we're voting on 3.11.5 right now).
4. Increase memory to 8GB minimum, preferably 12.

I usually don't like making a bunch of changes without knowing the root
cause of a problem, but in your case there's so many potential problems I
don't think it's worth digging into, especially since the problem might be
one of the 500 or so bugs that were fixed since this release.

Once you've done those things it'll be easier to narrow down the problem.

Jon


On Thu, Oct 24, 2019 at 4:59 PM Ben Mills  wrote:

> Hi Sergio,
>
> No, not at this time.
>
> It was in use with this cluster previously, and while there were no
> reaper-specific issues, it was removed to help simplify investigation of
> the underlying repair issues I've described.
>
> Thanks.
>
> On Thu, Oct 24, 2019 at 4:21 PM Sergio  wrote:
>
>> Are you using Cassandra reaper?
>>
>> On Thu, Oct 24, 2019, 12:31 PM Ben Mills  wrote:
>>
>>> Greetings,
>>>
>>> Inherited a small Cassandra cluster with some repair issues and need
>>> some advice on recommended next steps. Apologies in advance for a long
>>> email.
>>>
>>> Issue:
>>>
>>> Intermittent repair failures on two non-system keyspaces.
>>>
>>> - platform_users
>>> - platform_management
>>>
>>> Repair Type:
>>>
>>> Full, parallel repairs are run on each of the three nodes every five
>>> days.
>>>
>>> Repair command output for a typical failure:
>>>
>>> [2019-10-18 00:22:09,109] Starting repair command #46, repairing
>>> keyspace platform_users with repair options (parallelism: parallel, primary
>>> range: false, incremental: false, job threads: 1, ColumnFamilies: [],
>>> dataCenters: [], hosts: [], # of ranges: 12)
>>> [2019-10-18 00:22:09,242] Repair session
>>> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
>>> [(-1890954128429545684,2847510199483651721],
>>> (8249813014782655320,-8746483007209345011],
>>> (4299912178579297893,6811748355903297393],
>>> (-8746483007209345011,-8628999431140554276],
>>> (-5865769407232506956,-4746990901966533744],
>>> (-4470950459111056725,-1890954128429545684],
>>> (4001531392883953257,4299912178579297893],
>>> (6811748355903297393,6878104809564599690],
>>> (6878104809564599690,8249813014782655320],
>>> (-4746990901966533744,-4470950459111056725],
>>> (-8628999431140554276,-5865769407232506956],
>>> (2847510199483651721,4001531392883953257]] failed with error [repair
>>> #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2,
>>> [(-1890954128429545684,2847510199483651721],
>>> (8249813014782655320,-8746483007209345011],
>>> (4299912178579297893,6811748355903297393],
>>> (-8746483007209345011,-8628999431140554276],
>>> (-5865769407232506956,-4746990901966533744],
>>> (-4470950459111056725,-1890954128429545684],
>>> (4001531392883953257,4299912178579297893],
>>> (6811748355903297393,6878104809564599690],
>>> (6878104809564599690,8249813014782655320],
>>> (-4746990901966533744,-4470950459111056725],
>>> (-8628999431140554276,-5865769407232506956],
>>> (2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x
>>> (progress: 26%)
>>> [2019-10-18 00:22:09,246] Some repair failed
>>> [2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds
>>>
>>> Additional Notes:
>>>
>>> Repairs encounter above failures more often than not. Sometimes on one
>>> node only, though occasionally on two. Sometimes just one of the two
>>> keyspaces, sometimes both. Apparently the previous repair schedule for
>>> this cluster included incremental repairs (script alternated between
>>> incremental and full repairs). After reading this TLP article:
>>>
>>>
>>> https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
>>>
>>> the repair script was replaced with cassandra-reaper (v1.4.0), which was
>>> run with its default configs. Reaper was fine but only obscured the ongoing
>>> issues (it did not resolve them) and complicated the debugging process

Re: Repair Issues

2019-10-24 Thread Ben Mills

Hi Sergio,

No, not at this time.

It was in use with this cluster previously, and while there were no
reaper-specific issues, it was removed to help simplify investigation of
the underlying repair issues I've described.

Thanks.

On Thu, Oct 24, 2019 at 4:21 PM Sergio  wrote:

> Are you using Cassandra reaper?
>
> On Thu, Oct 24, 2019, 12:31 PM Ben Mills  wrote:
>
>> Greetings,
>>
>> Inherited a small Cassandra cluster with some repair issues and need some
>> advice on recommended next steps. Apologies in advance for a long email.
>>
>> Issue:
>>
>> Intermittent repair failures on two non-system keyspaces.
>>
>> - platform_users
>> - platform_management
>>
>> Repair Type:
>>
>> Full, parallel repairs are run on each of the three nodes every five days.
>>
>> Repair command output for a typical failure:
>>
>> [2019-10-18 00:22:09,109] Starting repair command #46, repairing keyspace
>> platform_users with repair options (parallelism: parallel, primary range:
>> false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters:
>> [], hosts: [], # of ranges: 12)
>> [2019-10-18 00:22:09,242] Repair session
>> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
>> [(-1890954128429545684,2847510199483651721],
>> (8249813014782655320,-8746483007209345011],
>> (4299912178579297893,6811748355903297393],
>> (-8746483007209345011,-8628999431140554276],
>> (-5865769407232506956,-4746990901966533744],
>> (-4470950459111056725,-1890954128429545684],
>> (4001531392883953257,4299912178579297893],
>> (6811748355903297393,6878104809564599690],
>> (6878104809564599690,8249813014782655320],
>> (-4746990901966533744,-4470950459111056725],
>> (-8628999431140554276,-5865769407232506956],
>> (2847510199483651721,4001531392883953257]] failed with error [repair
>> #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2,
>> [(-1890954128429545684,2847510199483651721],
>> (8249813014782655320,-8746483007209345011],
>> (4299912178579297893,6811748355903297393],
>> (-8746483007209345011,-8628999431140554276],
>> (-5865769407232506956,-4746990901966533744],
>> (-4470950459111056725,-1890954128429545684],
>> (4001531392883953257,4299912178579297893],
>> (6811748355903297393,6878104809564599690],
>> (6878104809564599690,8249813014782655320],
>> (-4746990901966533744,-4470950459111056725],
>> (-8628999431140554276,-5865769407232506956],
>> (2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x
>> (progress: 26%)
>> [2019-10-18 00:22:09,246] Some repair failed
>> [2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds
>>
>> Additional Notes:
>>
>> Repairs encounter above failures more often than not. Sometimes on one
>> node only, though occasionally on two. Sometimes just one of the two
>> keyspaces, sometimes both. Apparently the previous repair schedule for
>> this cluster included incremental repairs (script alternated between
>> incremental and full repairs). After reading this TLP article:
>>
>>
>> https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
>>
>> the repair script was replaced with cassandra-reaper (v1.4.0), which was
>> run with its default configs. Reaper was fine but only obscured the ongoing
>> issues (it did not resolve them) and complicated the debugging process and
>> so was then removed. The current repair schedule is as described above
>> under Repair Type.
>>
>> Attempts at Resolution:
>>
>> (1) nodetool scrub was attempted on the offending keyspaces/tables to no
>> effect.
>>
>> (2) sstablescrub has not been attempted due to the current design of the
>> Docker image that runs Cassandra in each Kubernetes pod - i.e. there is no
>> way to stop the server to run this utility without killing the only pid
>> running in the container.
>>
>> Related Error:
>>
>> Not sure if this is related, though sometimes, when either:
>>
>> (a) Running nodetool snapshot, or
>> (b) Rolling a pod that runs a Cassandra node, which calls nodetool drain
>> prior shutdown,
>>
>> the following error is thrown:
>>
>> -- StackTrace --
>> java.lang.RuntimeException: Last written key
>> DecoratedKey(10df3ba1-6eb2-4c8e-bddd-c0c7af586bda,
>> 10df3ba16eb24c8ebdddc0c7af586bda) >= current key
>> DecoratedKey(----,
>> 17343121887f480c9ba87c0e32206b74) writing into
>> /cassandra_data/data/platform_management/device_by_tenant_v2-e91529202ccf11e7ab96d5693

Re: Repair Issues

2019-10-24 Thread Reid Pinchback

Ben, you may find this helpful:

https://blog.pythian.com/so-you-have-a-broken-cassandra-sstable-file/

From: Ben Mills 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, October 24, 2019 at 3:31 PM
To: "user@cassandra.apache.org" 
Subject: Repair Issues

Message from External Sender
Greetings,

Inherited a small Cassandra cluster with some repair issues and need some 
advice on recommended next steps. Apologies in advance for a long email.

Issue:

Intermittent repair failures on two non-system keyspaces.

- platform_users
- platform_management

Repair Type:

Full, parallel repairs are run on each of the three nodes every five days.

Repair command output for a typical failure:

[2019-10-18 00:22:09,109] Starting repair command #46, repairing keyspace 
platform_users with repair options (parallelism: parallel, primary range: 
false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], 
hosts: [], # of ranges: 12)
[2019-10-18 00:22:09,242] Repair session 5282be70-f13d-11e9-9b4e-7f6db768ba9a 
for range [(-1890954128429545684,2847510199483651721], 
(8249813014782655320,-8746483007209345011], 
(4299912178579297893,6811748355903297393], 
(-8746483007209345011,-8628999431140554276], 
(-5865769407232506956,-4746990901966533744], 
(-4470950459111056725,-1890954128429545684], 
(4001531392883953257,4299912178579297893], 
(6811748355903297393,6878104809564599690], 
(6878104809564599690,8249813014782655320], 
(-4746990901966533744,-4470950459111056725], 
(-8628999431140554276,-5865769407232506956], 
(2847510199483651721,4001531392883953257]] failed with error [repair 
#5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2, 
[(-1890954128429545684,2847510199483651721], 
(8249813014782655320,-8746483007209345011], 
(4299912178579297893,6811748355903297393], 
(-8746483007209345011,-8628999431140554276], 
(-5865769407232506956,-4746990901966533744], 
(-4470950459111056725,-1890954128429545684], 
(4001531392883953257,4299912178579297893], 
(6811748355903297393,6878104809564599690], 
(6878104809564599690,8249813014782655320], 
(-4746990901966533744,-4470950459111056725], 
(-8628999431140554276,-5865769407232506956], 
(2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x 
(progress: 26%)
[2019-10-18 00:22:09,246] Some repair failed
[2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds

Additional Notes:

Repairs encounter above failures more often than not. Sometimes on one node 
only, though occasionally on two. Sometimes just one of the two keyspaces, 
sometimes both. Apparently the previous repair schedule for this cluster 
included incremental repairs (script alternated between incremental and full 
repairs). After reading this TLP article:

https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__thelastpickle.com_blog_2017_12_14_should-2Dyou-2Duse-2Dincremental-2Drepair.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=IS_T0jkqMzq1WUvU2M2bsp86B8WWcNuhUoWjudSR_t0=s4UG2uUbhDqyEE7itCF4vYdDQTg7kxJ6LcipRE71Jqw=>

the repair script was replaced with cassandra-reaper (v1.4.0), which was run 
with its default configs. Reaper was fine but only obscured the ongoing issues 
(it did not resolve them) and complicated the debugging process and so was then 
removed. The current repair schedule is as described above under Repair Type.

Attempts at Resolution:

(1) nodetool scrub was attempted on the offending keyspaces/tables to no effect.

(2) sstablescrub has not been attempted due to the current design of the Docker 
image that runs Cassandra in each Kubernetes pod - i.e. there is no way to stop 
the server to run this utility without killing the only pid running in the 
container.

Related Error:

Not sure if this is related, though sometimes, when either:

(a) Running nodetool snapshot, or
(b) Rolling a pod that runs a Cassandra node, which calls nodetool drain prior 
shutdown,

the following error is thrown:

-- StackTrace --
java.lang.RuntimeException: Last written key 
DecoratedKey(10df3ba1-6eb2-4c8e-bddd-c0c7af586bda, 
10df3ba16eb24c8ebdddc0c7af586bda) >= current key 
DecoratedKey(----, 
17343121887f480c9ba87c0e32206b74) writing into 
/cassandra_data/data/platform_management/device_by_tenant_v2-e91529202ccf11e7ab96d5693708c583/.device_by_tenant_tags_idx/mb-45-big-Data.db
at 
org.apache.cassandra.io.sstable.format.big.BigTableWriter.beforeAppend(BigTableWriter.java:114)
at 
org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:153)
at 
org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
at 
org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:441)
at 
org.apache.cassandra.db.Memtable$FlushRu

Re: Repair Issues

2019-10-24 Thread Sergio

Are you using Cassandra reaper?

On Thu, Oct 24, 2019, 12:31 PM Ben Mills  wrote:

> Greetings,
>
> Inherited a small Cassandra cluster with some repair issues and need some
> advice on recommended next steps. Apologies in advance for a long email.
>
> Issue:
>
> Intermittent repair failures on two non-system keyspaces.
>
> - platform_users
> - platform_management
>
> Repair Type:
>
> Full, parallel repairs are run on each of the three nodes every five days.
>
> Repair command output for a typical failure:
>
> [2019-10-18 00:22:09,109] Starting repair command #46, repairing keyspace
> platform_users with repair options (parallelism: parallel, primary range:
> false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters:
> [], hosts: [], # of ranges: 12)
> [2019-10-18 00:22:09,242] Repair session
> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
> [(-1890954128429545684,2847510199483651721],
> (8249813014782655320,-8746483007209345011],
> (4299912178579297893,6811748355903297393],
> (-8746483007209345011,-8628999431140554276],
> (-5865769407232506956,-4746990901966533744],
> (-4470950459111056725,-1890954128429545684],
> (4001531392883953257,4299912178579297893],
> (6811748355903297393,6878104809564599690],
> (6878104809564599690,8249813014782655320],
> (-4746990901966533744,-4470950459111056725],
> (-8628999431140554276,-5865769407232506956],
> (2847510199483651721,4001531392883953257]] failed with error [repair
> #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2,
> [(-1890954128429545684,2847510199483651721],
> (8249813014782655320,-8746483007209345011],
> (4299912178579297893,6811748355903297393],
> (-8746483007209345011,-8628999431140554276],
> (-5865769407232506956,-4746990901966533744],
> (-4470950459111056725,-1890954128429545684],
> (4001531392883953257,4299912178579297893],
> (6811748355903297393,6878104809564599690],
> (6878104809564599690,8249813014782655320],
> (-4746990901966533744,-4470950459111056725],
> (-8628999431140554276,-5865769407232506956],
> (2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x
> (progress: 26%)
> [2019-10-18 00:22:09,246] Some repair failed
> [2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds
>
> Additional Notes:
>
> Repairs encounter above failures more often than not. Sometimes on one
> node only, though occasionally on two. Sometimes just one of the two
> keyspaces, sometimes both. Apparently the previous repair schedule for
> this cluster included incremental repairs (script alternated between
> incremental and full repairs). After reading this TLP article:
>
>
> https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
>
> the repair script was replaced with cassandra-reaper (v1.4.0), which was
> run with its default configs. Reaper was fine but only obscured the ongoing
> issues (it did not resolve them) and complicated the debugging process and
> so was then removed. The current repair schedule is as described above
> under Repair Type.
>
> Attempts at Resolution:
>
> (1) nodetool scrub was attempted on the offending keyspaces/tables to no
> effect.
>
> (2) sstablescrub has not been attempted due to the current design of the
> Docker image that runs Cassandra in each Kubernetes pod - i.e. there is no
> way to stop the server to run this utility without killing the only pid
> running in the container.
>
> Related Error:
>
> Not sure if this is related, though sometimes, when either:
>
> (a) Running nodetool snapshot, or
> (b) Rolling a pod that runs a Cassandra node, which calls nodetool drain
> prior shutdown,
>
> the following error is thrown:
>
> -- StackTrace --
> java.lang.RuntimeException: Last written key
> DecoratedKey(10df3ba1-6eb2-4c8e-bddd-c0c7af586bda,
> 10df3ba16eb24c8ebdddc0c7af586bda) >= current key
> DecoratedKey(----,
> 17343121887f480c9ba87c0e32206b74) writing into
> /cassandra_data/data/platform_management/device_by_tenant_v2-e91529202ccf11e7ab96d5693708c583/.device_by_tenant_tags_idx/mb-45-big-Data.db
> at
> org.apache.cassandra.io.sstable.format.big.BigTableWriter.beforeAppend(BigTableWriter.java:114)
> at
> org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:153)
> at
> org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
> at
> org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:441)
> at
> org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:477)
> at
> org.apache.

Repair Issues

2019-10-24 Thread Ben Mills

Greetings,

Inherited a small Cassandra cluster with some repair issues and need some
advice on recommended next steps. Apologies in advance for a long email.

Issue:

Intermittent repair failures on two non-system keyspaces.

- platform_users
- platform_management

Repair Type:

Full, parallel repairs are run on each of the three nodes every five days.

Repair command output for a typical failure:

[2019-10-18 00:22:09,109] Starting repair command #46, repairing keyspace
platform_users with repair options (parallelism: parallel, primary range:
false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters:
[], hosts: [], # of ranges: 12)
[2019-10-18 00:22:09,242] Repair session
5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
[(-1890954128429545684,2847510199483651721],
(8249813014782655320,-8746483007209345011],
(4299912178579297893,6811748355903297393],
(-8746483007209345011,-8628999431140554276],
(-5865769407232506956,-4746990901966533744],
(-4470950459111056725,-1890954128429545684],
(4001531392883953257,4299912178579297893],
(6811748355903297393,6878104809564599690],
(6878104809564599690,8249813014782655320],
(-4746990901966533744,-4470950459111056725],
(-8628999431140554276,-5865769407232506956],
(2847510199483651721,4001531392883953257]] failed with error [repair
#5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2,
[(-1890954128429545684,2847510199483651721],
(8249813014782655320,-8746483007209345011],
(4299912178579297893,6811748355903297393],
(-8746483007209345011,-8628999431140554276],
(-5865769407232506956,-4746990901966533744],
(-4470950459111056725,-1890954128429545684],
(4001531392883953257,4299912178579297893],
(6811748355903297393,6878104809564599690],
(6878104809564599690,8249813014782655320],
(-4746990901966533744,-4470950459111056725],
(-8628999431140554276,-5865769407232506956],
(2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x
(progress: 26%)
[2019-10-18 00:22:09,246] Some repair failed
[2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds

Additional Notes:

Repairs encounter above failures more often than not. Sometimes on one node
only, though occasionally on two. Sometimes just one of the two keyspaces,
sometimes both. Apparently the previous repair schedule for this cluster
included incremental repairs (script alternated between incremental and
full repairs). After reading this TLP article:

https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html

the repair script was replaced with cassandra-reaper (v1.4.0), which was
run with its default configs. Reaper was fine but only obscured the ongoing
issues (it did not resolve them) and complicated the debugging process and
so was then removed. The current repair schedule is as described above
under Repair Type.

Attempts at Resolution:

(1) nodetool scrub was attempted on the offending keyspaces/tables to no
effect.

(2) sstablescrub has not been attempted due to the current design of the
Docker image that runs Cassandra in each Kubernetes pod - i.e. there is no
way to stop the server to run this utility without killing the only pid
running in the container.

Related Error:

Not sure if this is related, though sometimes, when either:

(a) Running nodetool snapshot, or
(b) Rolling a pod that runs a Cassandra node, which calls nodetool drain
prior shutdown,

the following error is thrown:

-- StackTrace --
java.lang.RuntimeException: Last written key
DecoratedKey(10df3ba1-6eb2-4c8e-bddd-c0c7af586bda,
10df3ba16eb24c8ebdddc0c7af586bda) >= current key
DecoratedKey(----,
17343121887f480c9ba87c0e32206b74) writing into
/cassandra_data/data/platform_management/device_by_tenant_v2-e91529202ccf11e7ab96d5693708c583/.device_by_tenant_tags_idx/mb-45-big-Data.db
at
org.apache.cassandra.io.sstable.format.big.BigTableWriter.beforeAppend(BigTableWriter.java:114)
at
org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:153)
at
org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
at
org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:441)
at
org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:477)
at
org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:363)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Here are some details on the environment and configs in the event that
something is relevant.

Environment: Kubernetes
Environment Config: Stateful set of 3 replicas
Storage: Persistent Volumes
Storage Class: SSD
Node OS: Container-Optimized OS
Container OS: Ubu

Re: Repair Issues

Re: Repair Issues

Re: Repair Issues

Re: Repair Issues

Re: Repair Issues

Re: Repair Issues

Re: Repair Issues

Re: Repair Issues

Repair Issues

9 matches

Site Navigation

Mail list logo

Footer information