Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-21 Thread Samuel Just
Ok, I've added explicit support for osd_snap_trim_sleep (same param, new
non-blocking implementation) to that branch.  Care to take it for a whirl?
-Sam

On Thu, Feb 9, 2017 at 11:36 AM, Nick Fisk <n...@fisk.me.uk> wrote:

> Building now
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Samuel Just
> *Sent:* 09 February 2017 19:22
> *To:* Nick Fisk <n...@fisk.me.uk>
> *Cc:* ceph-users@lists.ceph.com
>
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep
> <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAFklQUAAADNJBWwAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/1/h2R0YQ3R0gPNSCn9TMuW8Q/aHR0cHM6Ly9naXRodWIuY29tL2F0aGFuYXRvcy9jZXBoL3RyZWUvd2lwLXNuYXAtdHJpbS1zbGVlcA>
> (based on master) passed a rados suite.  It adds a configurable limit to
> the number of pgs which can be trimming on any OSD (default: 2).  PGs
> trimming will be in snaptrim state, PGs waiting to trim will be in
> snaptrim_wait state.  I suspect this'll be adequate to throttle the amount
> of trimming.  If not, I can try to add an explicit limit to the rate at
> which the work items trickle into the queue.  Can someone test this branch?
>   Tester beware: this has not merged into master yet and should only be run
> on a disposable cluster.
>
> -Sam
>
>
>
> On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>
> Yeah it’s probably just the fact that they have more PG’s so they will
> hold more data and thus serve more IO. As they have a fixed IO limit, they
> will always hit the limit first and become the bottleneck.
>
>
>
> The main problem with reducing the filestore queue is that I believe you
> will start to lose the benefit of having IO’s queued up on the disk, so
> that the scheduler can re-arrange them to action them in the most efficient
> manor as the disk head moves across the platters. You might possibly see up
> to a 20% hit on performance, in exchange for more consistent client
> latency.
>
>
>
> *From:* Steve Taylor [mailto:steve.tay...@storagecraft.com]
> *Sent:* 07 February 2017 20:35
> *To:* n...@fisk.me.uk; ceph-users@lists.ceph.com
>
>
> *Subject:* RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Thanks, Nick.
>
>
>
> One other data point that has come up is that nearly all of the blocked
> requests that are waiting on subops are waiting for OSDs with more PGs than
> the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB
> OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs
> individually have 33% more PGs than the others and are causing almost all
> of the blocked requests. It appears that maps updates are generally not
> blocking long enough to show up as blocked requests.
>
>
>
> I set the reweight on those 7 OSDs to 0.75 and things are backfilling now.
> I’ll test some more when the PG counts per OSD are more balanced and see
> what I get. I’ll also play with the filestore queue. I was telling some of
> my colleagues yesterday that this looked likely to be related to buffer
> bloat somewhere. I appreciate the suggestion.
>
>
> --
>
>
> <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAFklQUAAADNJBWwAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/2/2T4Xj-_wncGT6Y6LyBEKdw/aHR0cDovL3hvNHQubWouYW0vbG5rL0FEc0FBR1ZFeFk0QUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWW1qa3VqWE1Hc2Z2MFFJMklrZHpkTVBIYk93QUFsQkkvMS9HclNQRjU2RnY2VXVUc1JUejFUbnJRL2FIUjBjSE02THk5emRHOXlZV2RsWTNKaFpuUXVZMjl0>
>
> *Steve* *Taylor* | Senior Software Engineer | StorageCraft Technology
> Corporation
> <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAFklQUAAADNJBWwAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/3/TFexBfD-LHnCcPturKjU5Q/aHR0cDovL3hvNHQubWouYW0vbG5rL0FEc0FBR1ZFeFk0QUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWW1qa3VqWE1Hc2Z2MFFJMklrZHpkTVBIYk93QUFsQkkvMi9IbGVSZWkzWVdEZGljbUN1RG9XeXRBL2FIUjBjSE02THk5emRHOXlZV2RsWTNKaFpuUXVZMjl0>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> *Office: *801.871.2799 <(801)%20871-2799> |
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
> --
>
> *From:* Nick Fisk [mailto:n...@fisk.me.uk]
> *Sent:* Tuesday, February 7, 2017 10:25 AM
> *To:* Steve Taylor <steve.tay...@storagecraft.com>;
> ceph-users@lists.ceph.com
> *Subject:* RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG 

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-09 Thread Samuel Just
Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep (based on
master) passed a rados suite.  It adds a configurable limit to the number
of pgs which can be trimming on any OSD (default: 2).  PGs trimming will be
in snaptrim state, PGs waiting to trim will be in snaptrim_wait state.  I
suspect this'll be adequate to throttle the amount of trimming.  If not, I
can try to add an explicit limit to the rate at which the work items
trickle into the queue.  Can someone test this branch?   Tester beware:
this has not merged into master yet and should only be run on a disposable
cluster.
-Sam

On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk  wrote:

> Yeah it’s probably just the fact that they have more PG’s so they will
> hold more data and thus serve more IO. As they have a fixed IO limit, they
> will always hit the limit first and become the bottleneck.
>
>
>
> The main problem with reducing the filestore queue is that I believe you
> will start to lose the benefit of having IO’s queued up on the disk, so
> that the scheduler can re-arrange them to action them in the most efficient
> manor as the disk head moves across the platters. You might possibly see up
> to a 20% hit on performance, in exchange for more consistent client
> latency.
>
>
>
> *From:* Steve Taylor [mailto:steve.tay...@storagecraft.com]
> *Sent:* 07 February 2017 20:35
> *To:* n...@fisk.me.uk; ceph-users@lists.ceph.com
>
> *Subject:* RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Thanks, Nick.
>
>
>
> One other data point that has come up is that nearly all of the blocked
> requests that are waiting on subops are waiting for OSDs with more PGs than
> the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB
> OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs
> individually have 33% more PGs than the others and are causing almost all
> of the blocked requests. It appears that maps updates are generally not
> blocking long enough to show up as blocked requests.
>
>
>
> I set the reweight on those 7 OSDs to 0.75 and things are backfilling now.
> I’ll test some more when the PG counts per OSD are more balanced and see
> what I get. I’ll also play with the filestore queue. I was telling some of
> my colleagues yesterday that this looked likely to be related to buffer
> bloat somewhere. I appreciate the suggestion.
>
>
> --
>
>
> 
>
> *Steve* *Taylor* | Senior Software Engineer | StorageCraft Technology
> Corporation
> 
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> *Office: *801.871.2799 <(801)%20871-2799> |
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
> --
>
> *From:* Nick Fisk [mailto:n...@fisk.me.uk]
> *Sent:* Tuesday, February 7, 2017 10:25 AM
> *To:* Steve Taylor ;
> ceph-users@lists.ceph.com
> *Subject:* RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Hi Steve,
>
>
>
> From what I understand, the issue is not with the queueing in Ceph, which
> is correctly moving client IO to the front of the queue. The problem lies
> below what Ceph controls, ie the scheduler and disk layer in Linux. Once
> the IO’s leave Ceph it’s a bit of a free for all and the client IO’s tend
> to get lost in large disk queues surrounded by all the snap trim IO’s.
>
>
>
> The workaround Sam is working on will limit the amount of snap trims that
> are allowed to run, which I believe will have a similar effect to the sleep
> parameters in pre-jewel clusters, but without pausing the whole IO thread.
>
>
>
> Ultimately the solution requires Ceph to be able to control the queuing of
> IO’s at the lower levels of the kernel. Whether this is via some sort of
> tagging per IO (currently CFQ is only per thread/process) or some other
> method, I don’t know. I was speaking to Sage and he thinks the easiest
> method might be to shrink the filestore queue so that you don’t get buffer
> bloat at the disk level. You should be able to test this out pretty easily
> now by changing the parameter, probably around a queue of 5-10 would be
> about right for spinning disks. It’s a trade off of peak throughput vs
> queue latency though.
>
>
>
> Nick
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] *On Behalf Of *Steve Taylor
> *Sent:* 07 February 2017 17:01
> *To:* 

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-03 Thread Samuel Just
Ok, I'm still working on a branch for master that will introduce limiter on
how many pgs can be trimming per osd at once.  It should backport trivially
to kraken, but jewel will require more work once we've got it in master.
Would you be willing to test the master version to determine whether it's
adequate?
-Sam

On Fri, Feb 3, 2017 at 10:54 AM, David Turner <david.tur...@storagecraft.com
> wrote:

> We found where it is in 10.2.5.  It is implemented in the OSD.h file in
> Jewel, but it is implemented in OSD.cc in Master.  We assumed it would be
> in the same place.
>
> We delete over 100TB of snapshots spread across thousands of snapshots
> every day.  We haven't yet found any combination of settings that allow us
> to delete snapshots in Jewel without blocking requests in a test cluster
> with a fraction of that workload.  We went as far as setting
> osd_snap_trim_cost to 512MB with default osd_snap_trim_priority (before we
> noticed the priority setting) and set osd_snap_trim_cost to 4MB (the size
> of our objects) with default_osd_snap_trim_priority set to 1.  We stopped
> testing there as we thought we found that these weren't implemented in
> Jewel.  We will continue our testing and provide an update when we have it.
>
> Our current solution in Hammer involves a daemon monitoring the cluster
> load and setting the osd_snap_trim_sleep accordingly between 0 and 0.35
> which does a good job of preventing IO blocking and help us to clear out
> the snap_trim_q each day.  These settings not being injectable in Jewel
> would negate an option of using variable settings throughout the day.
>
> --
>
> <https://storagecraft.com> David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
> <(385)%20224-2943>
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> --
>
> --
> *From:* Samuel Just [sj...@redhat.com]
> *Sent:* Friday, February 03, 2017 11:24 AM
> *To:* David Turner
> *Cc:* Nick Fisk; ceph-users
>
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
> They do seem to exist in Jewel.
> -Sam
>
> On Fri, Feb 3, 2017 at 10:12 AM, David Turner <
> david.tur...@storagecraft.com> wrote:
>
>> After searching the code, osd_snap_trim_cost and osd_snap_trim_priority
>> exist in Master but not in Jewel or Kraken.  If osd_snap_trim_sleep was
>> made useless in Jewel by moving snap trimming to the main op thread and no
>> new feature was added to Jewel to allow clusters to throttle snap
>> trimming... What recourse do people that use a lot of snapshots to use
>> Jewel?  Luckily this thread came around right before we were ready to push
>> to production and we tested snap trimming heavily in QA and found that we
>> can't even deal with half of our snap trimming on Jewel that we would need
>> to.  All of these settings are also not injectable into the osd daemon so
>> it would take a full restart of the all of the osds to change their
>> settings...
>>
>> Does anyone have any success stories for snap trimming on Jewel?
>>
>> --
>>
>> <https://storagecraft.com> David Turner | Cloud Operations Engineer | 
>> StorageCraft
>> Technology Corporation <https://storagecraft.com>
>> 380 Data Drive Suite 300 | Draper | Utah | 84020
>> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
>> <(385)%20224-2943>
>>
>> ------
>>
>> If you are not the intended recipient of this message or received it
>> erroneously, please notify the sender and delete it, together with any
>> attachments, and be advised that any dissemination or copying of this
>> message is prohibited.
>>
>> --
>>
>> --
>> *From:* Samuel Just [sj...@redhat.com]
>> *Sent:* Thursday, January 26, 2017 1:14 PM
>> *To:* Nick Fisk
>> *Cc:* David Turner; ceph-users
>>
>> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
>> sleep?
>>
>> Just an update.  I think the real goal with the sleep configs in general
>> was to reduce the number of concurrent snap trims ha

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-03 Thread Samuel Just
They do seem to exist in Jewel.
-Sam

On Fri, Feb 3, 2017 at 10:12 AM, David Turner <david.tur...@storagecraft.com
> wrote:

> After searching the code, osd_snap_trim_cost and osd_snap_trim_priority
> exist in Master but not in Jewel or Kraken.  If osd_snap_trim_sleep was
> made useless in Jewel by moving snap trimming to the main op thread and no
> new feature was added to Jewel to allow clusters to throttle snap
> trimming... What recourse do people that use a lot of snapshots to use
> Jewel?  Luckily this thread came around right before we were ready to push
> to production and we tested snap trimming heavily in QA and found that we
> can't even deal with half of our snap trimming on Jewel that we would need
> to.  All of these settings are also not injectable into the osd daemon so
> it would take a full restart of the all of the osds to change their
> settings...
>
> Does anyone have any success stories for snap trimming on Jewel?
>
> --
>
> <https://storagecraft.com> David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
> <(385)%20224-2943>
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> --
>
> --
> *From:* Samuel Just [sj...@redhat.com]
> *Sent:* Thursday, January 26, 2017 1:14 PM
> *To:* Nick Fisk
> *Cc:* David Turner; ceph-users
>
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
> Just an update.  I think the real goal with the sleep configs in general
> was to reduce the number of concurrent snap trims happening.  To that end,
> I've put together a branch which adds an AsyncReserver (as with backfill)
> for snap trims to each OSD.  Before actually starting to do trim work, the
> primary will wait in line to get one of the slots and will hold that slot
> until the repops are complete.  https://github.com/athanatos/
> ceph/tree/wip-snap-trim-sleep is the branch (based on master), but I've
> got a bit more work to do (and testing to do) before it's ready to be
> tested.
> -Sam
>
> On Fri, Jan 20, 2017 at 2:05 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>
>> Hi Sam,
>>
>>
>>
>> I have a test cluster, albeit small. I’m happy to run tests + graph
>> results with a wip branch and work out reasonable settings…etc
>>
>>
>>
>> *From:* Samuel Just [mailto:sj...@redhat.com]
>> *Sent:* 19 January 2017 23:23
>> *To:* David Turner <david.tur...@storagecraft.com>
>>
>> *Cc:* Nick Fisk <n...@fisk.me.uk>; ceph-users <ceph-users@lists.ceph.com>
>> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
>> sleep?
>>
>>
>>
>> I could probably put together a wip branch if you have a test cluster you
>> could try it out on.
>>
>> -Sam
>>
>>
>>
>> On Thu, Jan 19, 2017 at 2:27 PM, David Turner <
>> david.tur...@storagecraft.com> wrote:
>>
>> To be clear, we are willing to change to a snap_trim_sleep of 0 and try
>> to manage it with the other available settings... but it is sounding like
>> that won't really work for us since our main op thread(s) will just be
>> saturated with snap trimming almost all day.  We currently only have ~6
>> hours/day where our snap trim q's are empty.
>> --
>>
>>
>> <http://xo4t.mj.am/lnk/AEQAHbHmxC0AAFklQUAAADNJBWwAAACRXwBYgomZ7d_x14_XQr65vOmTmCx8lwAAlBI/1/HrVd_B9XmvZKT7jWcF0ftA/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t>
>>
>> *David* *Turner* | Cloud Operations Engineer | StorageCraft Technology
>> Corporation
>> <http://xo4t.mj.am/lnk/AEQAHbHmxC0AAFklQUAAADNJBWwAAACRXwBYgomZ7d_x14_XQr65vOmTmCx8lwAAlBI/2/oS1JCcJ92DHqKD4InYaflQ/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t>
>> 380 Data Drive Suite 300 | Draper | Utah | 84020
>> *Office: *801.871.2760 <(801)%20871-2760> | *Mobile: *385.224.2943
>> <(385)%20224-2943>
>> ------
>>
>> If you are not the intended recipient of this message or received it
>> erroneously, please notify the sender and delete it, together with any
>> attachments, and be advised that any dissemination or copying of this
>> message is prohibited.

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-01-26 Thread Samuel Just
Just an update.  I think the real goal with the sleep configs in general
was to reduce the number of concurrent snap trims happening.  To that end,
I've put together a branch which adds an AsyncReserver (as with backfill)
for snap trims to each OSD.  Before actually starting to do trim work, the
primary will wait in line to get one of the slots and will hold that slot
until the repops are complete.
https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep is the branch
(based on master), but I've got a bit more work to do (and testing to do)
before it's ready to be tested.
-Sam

On Fri, Jan 20, 2017 at 2:05 PM, Nick Fisk <n...@fisk.me.uk> wrote:

> Hi Sam,
>
>
>
> I have a test cluster, albeit small. I’m happy to run tests + graph
> results with a wip branch and work out reasonable settings…etc
>
>
>
> *From:* Samuel Just [mailto:sj...@redhat.com]
> *Sent:* 19 January 2017 23:23
> *To:* David Turner <david.tur...@storagecraft.com>
>
> *Cc:* Nick Fisk <n...@fisk.me.uk>; ceph-users <ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> I could probably put together a wip branch if you have a test cluster you
> could try it out on.
>
> -Sam
>
>
>
> On Thu, Jan 19, 2017 at 2:27 PM, David Turner <
> david.tur...@storagecraft.com> wrote:
>
> To be clear, we are willing to change to a snap_trim_sleep of 0 and try to
> manage it with the other available settings... but it is sounding like that
> won't really work for us since our main op thread(s) will just be saturated
> with snap trimming almost all day.  We currently only have ~6 hours/day
> where our snap trim q's are empty.
> --
>
>
> <http://xo4t.mj.am/lnk/AEQAHbHmxC0AAFklQUAAADNJBWwAAACRXwBYgomZ7d_x14_XQr65vOmTmCx8lwAAlBI/1/HrVd_B9XmvZKT7jWcF0ftA/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t>
>
> *David* *Turner* | Cloud Operations Engineer | StorageCraft Technology
> Corporation
> <http://xo4t.mj.am/lnk/AEQAHbHmxC0AAFklQUAAADNJBWwAAACRXwBYgomZ7d_x14_XQr65vOmTmCx8lwAAlBI/2/oS1JCcJ92DHqKD4InYaflQ/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> *Office: *801.871.2760 <(801)%20871-2760> | *Mobile: *385.224.2943
> <(385)%20224-2943>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
> --
> --
>
> *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of David
> Turner [david.tur...@storagecraft.com]
> *Sent:* Thursday, January 19, 2017 3:25 PM
> *To:* Samuel Just; Nick Fisk
>
>
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> We are a couple of weeks away from upgrading to Jewel in our production
> clusters (after months of testing in our QA environments), but this might
> prevent us from making the migration from Hammer.   We delete ~8,000
> snapshots/day between 3 clusters and our snap_trim_q gets up to about 60
> Million in each of those clusters.  We have to use an osd_snap_trim_sleep
> of 0.25 to prevent our clusters from falling on their faces during our big
> load and 0.1 the rest of the day to catch up on the snap trim q.
>
> Is our setup possible to use on Jewel?
> --
>
>
> <http://xo4t.mj.am/lnk/AEQAHbHmxC0AAFklQUAAADNJBWwAAACRXwBYgomZ7d_x14_XQr65vOmTmCx8lwAAlBI/3/2T-RUO4LiBzpOfKC3TJuaw/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t>
>
> *David* *Turner* | Cloud Operations Engineer | StorageCraft Technology
> Corporation
> <http://xo4t.mj.am/lnk/AEQAHbHmxC0AAFklQUAAADNJBWwAAACRXwBYgomZ7d_x14_XQr65vOmTmCx8lwAAlBI/4/YROlWfddSDN41u1mZn3-jA/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> *Office: *801.871.2760 <(801)%20871-2760> | *Mobile: *385.224.2943
> <(385)%20224-2943>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
> --
>
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Samuel
> Just [sj...@redhat.com]
> Sent: Thursday, January 19, 2017 2:45 PM
> To: Nick Fisk
> Cc: ceph-users
> Subject: Re: [ceph-users] o

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-01-19 Thread Samuel Just
I could probably put together a wip branch if you have a test cluster you
could try it out on.
-Sam

On Thu, Jan 19, 2017 at 2:27 PM, David Turner <david.tur...@storagecraft.com
> wrote:

> To be clear, we are willing to change to a snap_trim_sleep of 0 and try to
> manage it with the other available settings... but it is sounding like that
> won't really work for us since our main op thread(s) will just be saturated
> with snap trimming almost all day.  We currently only have ~6 hours/day
> where our snap trim q's are empty.
>
> --
>
> <https://storagecraft.com> David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
> <(385)%20224-2943>
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> --
>
> --
> *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of David
> Turner [david.tur...@storagecraft.com]
> *Sent:* Thursday, January 19, 2017 3:25 PM
> *To:* Samuel Just; Nick Fisk
>
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
> We are a couple of weeks away from upgrading to Jewel in our production
> clusters (after months of testing in our QA environments), but this might
> prevent us from making the migration from Hammer.   We delete ~8,000
> snapshots/day between 3 clusters and our snap_trim_q gets up to about 60
> Million in each of those clusters.  We have to use an osd_snap_trim_sleep
> of 0.25 to prevent our clusters from falling on their faces during our big
> load and 0.1 the rest of the day to catch up on the snap trim q.
>
> Is our setup possible to use on Jewel?
>
> --
>
> <https://storagecraft.com> David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
> <(385)%20224-2943>
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> --
>
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Samuel
> Just [sj...@redhat.com]
> Sent: Thursday, January 19, 2017 2:45 PM
> To: Nick Fisk
> Cc: ceph-users
> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?
>
> Yeah, I think you're probably right.  The answer is probably to add an
> explicit rate-limiting element to the way the snaptrim events are
> scheduled.
> -Sam
>
> On Thu, Jan 19, 2017 at 1:34 PM, Nick Fisk <n...@fisk.me.uk> wrote:
> > I will give those both a go and report back, but the more I thinking
> about this the less I'm convinced that it's going to help.
> >
> > I think the problem is a general IO imbalance, there is probably
> something like 100+ times more trimming IO than client IO and so even if
> client IO gets promoted to the front of the queue by Ceph, once it hits the
> Linux IO layer its fighting for itself. I guess this approach works with
> scrubbing as each read IO has to wait to be read before the next one is
> submitted, so the queue can be managed on the OSD. With trimming, writes
> can buffer up below what the OSD controls.
> >
> > I don't know if the snap trimming goes nuts because the journals are
> acking each request and the spinning disks can't keep up, or if it's
> something else. Does WBThrottle get involved with snap trimming?
> >
> > But from an underlying disk perspective, there is definitely more than 2
> snaps per OSD at a time going on, even if the OSD itself is not processing
> more than 2 at a time. I think there either needs to be another knob so
> that Ceph can throttle back snaps, not just de-prioritise them. Or, there
> needs a whole new kernel interface where an application can priority tag
> individual IO's for CFQ to handle, instead of the current limitation of
> priority per thread, I realise this is probably very very hard or
> impossible. But it would allow C

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-01-19 Thread Samuel Just
Yeah, I think you're probably right.  The answer is probably to add an
explicit rate-limiting element to the way the snaptrim events are
scheduled.
-Sam

On Thu, Jan 19, 2017 at 1:34 PM, Nick Fisk <n...@fisk.me.uk> wrote:
> I will give those both a go and report back, but the more I thinking about 
> this the less I'm convinced that it's going to help.
>
> I think the problem is a general IO imbalance, there is probably something 
> like 100+ times more trimming IO than client IO and so even if client IO gets 
> promoted to the front of the queue by Ceph, once it hits the Linux IO layer 
> its fighting for itself. I guess this approach works with scrubbing as each 
> read IO has to wait to be read before the next one is submitted, so the queue 
> can be managed on the OSD. With trimming, writes can buffer up below what the 
> OSD controls.
>
> I don't know if the snap trimming goes nuts because the journals are acking 
> each request and the spinning disks can't keep up, or if it's something else. 
> Does WBThrottle get involved with snap trimming?
>
> But from an underlying disk perspective, there is definitely more than 2 
> snaps per OSD at a time going on, even if the OSD itself is not processing 
> more than 2 at a time. I think there either needs to be another knob so that 
> Ceph can throttle back snaps, not just de-prioritise them. Or, there needs a 
> whole new kernel interface where an application can priority tag individual 
> IO's for CFQ to handle, instead of the current limitation of priority per 
> thread, I realise this is probably very very hard or impossible. But it would 
> allow Ceph to control IO queue's right down to the disk.
>
>> -Original Message-
>> From: Samuel Just [mailto:sj...@redhat.com]
>> Sent: 19 January 2017 18:58
>> To: Nick Fisk <n...@fisk.me.uk>
>> Cc: Dan van der Ster <d...@vanderster.com>; ceph-users 
>> <ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?
>>
>> Have you also tried setting osd_snap_trim_cost to be 16777216 (16x the 
>> default value, equal to a 16MB IO) and
>> osd_pg_max_concurrent_snap_trims to 1 (from 2)?
>> -Sam
>>
>> On Thu, Jan 19, 2017 at 7:57 AM, Nick Fisk <n...@fisk.me.uk> wrote:
>> > Hi Sam,
>> >
>> > Thanks for the confirmation on both which thread the trimming happens in 
>> > and for confirming my suspicion that sleeping is now a
>> bad idea.
>> >
>> > The problem I see is that even with setting the priority for trimming down 
>> > low, it still seems to completely swamp the cluster. The
>> trims seem to get submitted in an async nature which seems to leave all my 
>> disks sitting at queue depths of 50+ for several minutes
>> until the snapshot is removed, often also causing several OSD's to get 
>> marked out and start flapping. I'm using WPQ but haven't
>> changed the cutoff variable yet as I know you are working on fixing a bug 
>> with that.
>> >
>> > Nick
>> >
>> >> -Original Message-
>> >> From: Samuel Just [mailto:sj...@redhat.com]
>> >> Sent: 19 January 2017 15:47
>> >> To: Dan van der Ster <d...@vanderster.com>
>> >> Cc: Nick Fisk <n...@fisk.me.uk>; ceph-users
>> >> <ceph-users@lists.ceph.com>
>> >> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?
>> >>
>> >> Snaptrimming is now in the main op threadpool along with scrub,
>> >> recovery, and client IO.  I don't think it's a good idea to use any of 
>> >> the _sleep configs anymore -- the intention is that by setting the
>> priority low, they won't actually be scheduled much.
>> >> -Sam
>> >>
>> >> On Thu, Jan 19, 2017 at 5:40 AM, Dan van der Ster <d...@vanderster.com> 
>> >> wrote:
>> >> > On Thu, Jan 19, 2017 at 1:28 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>> >> >> Hi Dan,
>> >> >>
>> >> >> I carried out some more testing after doubling the op threads, it
>> >> >> may have had a small benefit as potentially some threads are
>> >> >> available, but latency still sits more or less around the
>> >> >> configured snap sleep time. Even more threads might help, but I
>> >> >> suspect you are just
>> >> lowering the chance of IO's that are stuck behind the sleep, rather than 
>> >> actually solving the problem.
>> >> >>
>> >> >> I'm guessing when the snap trimmin

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-01-19 Thread Samuel Just
Have you also tried setting osd_snap_trim_cost to be 16777216 (16x the
default value, equal to a 16MB IO) and
osd_pg_max_concurrent_snap_trims to 1 (from 2)?
-Sam

On Thu, Jan 19, 2017 at 7:57 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> Hi Sam,
>
> Thanks for the confirmation on both which thread the trimming happens in and 
> for confirming my suspicion that sleeping is now a bad idea.
>
> The problem I see is that even with setting the priority for trimming down 
> low, it still seems to completely swamp the cluster. The trims seem to get 
> submitted in an async nature which seems to leave all my disks sitting at 
> queue depths of 50+ for several minutes until the snapshot is removed, often 
> also causing several OSD's to get marked out and start flapping. I'm using 
> WPQ but haven't changed the cutoff variable yet as I know you are working on 
> fixing a bug with that.
>
> Nick
>
>> -Original Message-
>> From: Samuel Just [mailto:sj...@redhat.com]
>> Sent: 19 January 2017 15:47
>> To: Dan van der Ster <d...@vanderster.com>
>> Cc: Nick Fisk <n...@fisk.me.uk>; ceph-users <ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?
>>
>> Snaptrimming is now in the main op threadpool along with scrub, recovery, 
>> and client IO.  I don't think it's a good idea to use any of the
>> _sleep configs anymore -- the intention is that by setting the priority low, 
>> they won't actually be scheduled much.
>> -Sam
>>
>> On Thu, Jan 19, 2017 at 5:40 AM, Dan van der Ster <d...@vanderster.com> 
>> wrote:
>> > On Thu, Jan 19, 2017 at 1:28 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>> >> Hi Dan,
>> >>
>> >> I carried out some more testing after doubling the op threads, it may
>> >> have had a small benefit as potentially some threads are available,
>> >> but latency still sits more or less around the configured snap sleep 
>> >> time. Even more threads might help, but I suspect you are just
>> lowering the chance of IO's that are stuck behind the sleep, rather than 
>> actually solving the problem.
>> >>
>> >> I'm guessing when the snap trimming was in disk thread, you wouldn't
>> >> have noticed these sleeps, but now it's in the op thread it will just sit 
>> >> there holding up all IO and be a lot more noticable. It might be
>> that this option shouldn't be used with Jewel+?
>> >
>> > That's a good thought -- so we need confirmation which thread is doing
>> > the snap trimming. I honestly can't figure it out from the code --
>> > hopefully a dev could explain how it works.
>> >
>> > Otherwise, I don't have much practical experience with snap trimming
>> > in jewel yet -- our RBD cluster is still running 0.94.9.
>> >
>> > Cheers, Dan
>> >
>> >
>> >>
>> >>> -Original Message-
>> >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> >>> Behalf Of Nick Fisk
>> >>> Sent: 13 January 2017 20:38
>> >>> To: 'Dan van der Ster' <d...@vanderster.com>
>> >>> Cc: 'ceph-users' <ceph-users@lists.ceph.com>
>> >>> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during 
>> >>> sleep?
>> >>>
>> >>> We're on Jewel and your right, I'm pretty sure the snap stuff is also 
>> >>> now handled in the op thread.
>> >>>
>> >>> The dump historic ops socket command showed a 10s delay at the
>> >>> "Reached PG" stage, from Greg's response [1], it would suggest that
>> >>> the OSD itself isn't blocking but the PG it's currently sleeping
>> >>> whilst trimming. I think in the former case, it would have a
>> >> high time
>> >>> on the "Started" part of the op? Anyway I will carry out some more
>> >>> testing with higher osd op threads and see if that makes any difference. 
>> >>> Thanks for the suggestion.
>> >>>
>> >>> Nick
>> >>>
>> >>>
>> >>> [1]
>> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/00865
>> >>> 2.html
>> >>>
>> >>> > -Original Message-
>> >>> > From: Dan van der Ster [mailto:d...@vanderster.com]
>> >>> > Sent: 13 January 2017 10:28
>> >>> > To: Nick Fisk <n...@fisk.me.uk>
>>

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-01-19 Thread Samuel Just
Snaptrimming is now in the main op threadpool along with scrub,
recovery, and client IO.  I don't think it's a good idea to use any of
the _sleep configs anymore -- the intention is that by setting the
priority low, they won't actually be scheduled much.
-Sam

On Thu, Jan 19, 2017 at 5:40 AM, Dan van der Ster  wrote:
> On Thu, Jan 19, 2017 at 1:28 PM, Nick Fisk  wrote:
>> Hi Dan,
>>
>> I carried out some more testing after doubling the op threads, it may have 
>> had a small benefit as potentially some threads are
>> available, but latency still sits more or less around the configured snap 
>> sleep time. Even more threads might help, but I suspect
>> you are just lowering the chance of IO's that are stuck behind the sleep, 
>> rather than actually solving the problem.
>>
>> I'm guessing when the snap trimming was in disk thread, you wouldn't have 
>> noticed these sleeps, but now it's in the op thread it
>> will just sit there holding up all IO and be a lot more noticable. It might 
>> be that this option shouldn't be used with Jewel+?
>
> That's a good thought -- so we need confirmation which thread is doing
> the snap trimming. I honestly can't figure it out from the code --
> hopefully a dev could explain how it works.
>
> Otherwise, I don't have much practical experience with snap trimming
> in jewel yet -- our RBD cluster is still running 0.94.9.
>
> Cheers, Dan
>
>
>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>>> Nick Fisk
>>> Sent: 13 January 2017 20:38
>>> To: 'Dan van der Ster' 
>>> Cc: 'ceph-users' 
>>> Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?
>>>
>>> We're on Jewel and your right, I'm pretty sure the snap stuff is also now 
>>> handled in the op thread.
>>>
>>> The dump historic ops socket command showed a 10s delay at the "Reached PG" 
>>> stage, from Greg's response [1], it would suggest
>>> that the OSD itself isn't blocking but the PG it's currently sleeping 
>>> whilst trimming. I think in the former case, it would have a
>> high time
>>> on the "Started" part of the op? Anyway I will carry out some more testing 
>>> with higher osd op threads and see if that makes any
>>> difference. Thanks for the suggestion.
>>>
>>> Nick
>>>
>>>
>>> [1] 
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008652.html
>>>
>>> > -Original Message-
>>> > From: Dan van der Ster [mailto:d...@vanderster.com]
>>> > Sent: 13 January 2017 10:28
>>> > To: Nick Fisk 
>>> > Cc: ceph-users 
>>> > Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?
>>> >
>>> > Hammer or jewel? I've forgotten which thread pool is handling the snap
>>> > trim nowadays -- is it the op thread yet? If so, perhaps all the op 
>>> > threads are stuck sleeping? Just a wild guess. (Maybe
>> increasing #
>>> op threads would help?).
>>> >
>>> > -- Dan
>>> >
>>> >
>>> > On Thu, Jan 12, 2017 at 3:11 PM, Nick Fisk  wrote:
>>> > > Hi,
>>> > >
>>> > > I had been testing some higher values with the osd_snap_trim_sleep
>>> > > variable to try and reduce the impact of removing RBD snapshots on
>>> > > our cluster and I have come across what I believe to be a possible
>>> > > unintended consequence. The value of the sleep seems to keep the
>>> > lock on the PG open so that no other IO can use the PG whilst the snap 
>>> > removal operation is sleeping.
>>> > >
>>> > > I had set the variable to 10s to completely minimise the impact as I
>>> > > had some multi TB snapshots to remove and noticed that suddenly all
>>> > > IO to the cluster had a latency of roughly 10s as well, all the
>>> > dumped ops show waiting on PG for 10s as well.
>>> > >
>>> > > Is the osd_snap_trim_sleep variable only ever meant to be used up to
>>> > > say a max of 0.1s and this is a known side effect, or should the
>>> > > lock on the PG be removed so that normal IO can continue during the
>>> > sleeps?
>>> > >
>>> > > Nick
>>> > >
>>> > > ___
>>> > > ceph-users mailing list
>>> > > ceph-users@lists.ceph.com
>>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Samuel Just
That would work.
-Sam

On Thu, Jan 12, 2017 at 1:40 PM, Gregory Farnum <gfar...@redhat.com> wrote:
> On Thu, Jan 12, 2017 at 1:37 PM, Samuel Just <sj...@redhat.com> wrote:
>> Oh, this is basically working as intended.  What happened is that the
>> mon died before the pending map was actually committed.  The OSD has a
>> timeout (5s) after which it stops trying to mark itself down and just
>> dies (so that OSDs don't hang when killed).  It took a bit longer than
>> 5s for the remaining 2 mons to form a new quorum, so they never got
>> the MOSDMarkMeDown message so we had to do it the slow way.  I would
>> prefer this behavior to changing the mon shutdown process or making
>> the OSDs wait longer, so I think that's it.  If you want to avoid
>> disruption with colocated mons and osds, stop the osds first
>
> We can probably make our systemd scripts do this automatically? Or at
> least, there's a Ceph super-task thingy and I bet we can order the
> shutdown so it waits to kill the monitor until all the OSDs processes
> have ended.
>
>> and then
>> reboot.
>
>
>
>> -Sam
>>
>> On Thu, Jan 12, 2017 at 1:24 PM, Udo Lembke <ulem...@polarzone.de> wrote:
>>> Hi Sam,
>>>
>>> the webfrontend of an external ceph-dash was interrupted till the node
>>> was up again. The reboot took app. 5 min.
>>>
>>> But  the ceph -w output shows some IO much faster. I will look tomorrow
>>> at the output again and create an ticket.
>>>
>>>
>>> Thanks
>>>
>>>
>>> Udo
>>>
>>>
>>> On 12.01.2017 20:02, Samuel Just wrote:
>>>> How long did it take for the cluster to recover?
>>>> -Sam
>>>>
>>>> On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum <gfar...@redhat.com> 
>>>> wrote:
>>>>> On Thu, Jan 12, 2017 at 2:03 AM,  <ulem...@polarzone.de> wrote:
>>>>>> Hi all,
>>>>>> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
>>>>>> ceph-cluster. All nodes are mons and have two OSDs.
>>>>>> During reboot of one node, ceph stucks longer than normaly and I look in 
>>>>>> the
>>>>>> "ceph -w" output to find the reason.
>>>>>>
>>>>>> This is not the reason, but I'm wonder why "osd marked itself down" will 
>>>>>> not
>>>>>> recognised by the mons:
>>>>>> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
>>>>>> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
>>>>>> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
>>>>>> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
>>>>>> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
>>>>>> quorum 0,2
>>>>>> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 
>>>>>> 0,2
>>>>>> 0,2
>>>>>> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
>>>>>> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
>>>>>> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
>>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 
>>>>>> kB/s
>>>>>> wr, 15 op/s
>>>>>> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
>>>>>> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
>>>>>> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
>>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 
>>>>>> B/s
>>>>>> wr, 12 op/s
>>>>>> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
>>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
>>>>>> rd, 135 kB/s wr, 15 op/s
>>>>>> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
>>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 
>>>>>> B/s
>>>>>> rd, 189 kB/s wr, 7 op/s
>>>>>> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
>>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>>>>> 

Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Samuel Just
Oh, this is basically working as intended.  What happened is that the
mon died before the pending map was actually committed.  The OSD has a
timeout (5s) after which it stops trying to mark itself down and just
dies (so that OSDs don't hang when killed).  It took a bit longer than
5s for the remaining 2 mons to form a new quorum, so they never got
the MOSDMarkMeDown message so we had to do it the slow way.  I would
prefer this behavior to changing the mon shutdown process or making
the OSDs wait longer, so I think that's it.  If you want to avoid
disruption with colocated mons and osds, stop the osds first and then
reboot.
-Sam

On Thu, Jan 12, 2017 at 1:24 PM, Udo Lembke <ulem...@polarzone.de> wrote:
> Hi Sam,
>
> the webfrontend of an external ceph-dash was interrupted till the node
> was up again. The reboot took app. 5 min.
>
> But  the ceph -w output shows some IO much faster. I will look tomorrow
> at the output again and create an ticket.
>
>
> Thanks
>
>
> Udo
>
>
> On 12.01.2017 20:02, Samuel Just wrote:
>> How long did it take for the cluster to recover?
>> -Sam
>>
>> On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum <gfar...@redhat.com> wrote:
>>> On Thu, Jan 12, 2017 at 2:03 AM,  <ulem...@polarzone.de> wrote:
>>>> Hi all,
>>>> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
>>>> ceph-cluster. All nodes are mons and have two OSDs.
>>>> During reboot of one node, ceph stucks longer than normaly and I look in 
>>>> the
>>>> "ceph -w" output to find the reason.
>>>>
>>>> This is not the reason, but I'm wonder why "osd marked itself down" will 
>>>> not
>>>> recognised by the mons:
>>>> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
>>>> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
>>>> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
>>>> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
>>>> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
>>>> quorum 0,2
>>>> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,2
>>>> 0,2
>>>> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
>>>> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
>>>> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 kB/s
>>>> wr, 15 op/s
>>>> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
>>>> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
>>>> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 
>>>> B/s
>>>> wr, 12 op/s
>>>> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
>>>> rd, 135 kB/s wr, 15 op/s
>>>> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 B/s
>>>> rd, 189 kB/s wr, 7 op/s
>>>> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>>> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>>> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 failed 
>>>> (2
>>>> reporters from different host after 21.222945 >= grace 20.388836)
>>>> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 failed 
>>>> (2
>>>> reporters from different host after 21.222970 >= grace 20.388836)
>>>> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>>> 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
>>>> 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>>> 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in
>>>>
>>>> Why trust the mon not the osd? In this case the osdmap will be right app. 
>>>> 26
>>>> seconds earlier (the pgmap at 10:18:27.953410 is wrong).
>>>>
>>>> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>>> That's not what anybody intended to have happen. It's possible the
>>> simultaneous loss of a monitor and the OSDs is triggering a case
>>> that's not behaving correctly. Can you create a ticket at
>>> tracker.ceph.com with your logs and what steps you took and symptoms
>>> observed?
>>> -Greg
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Samuel Just
How long did it take for the cluster to recover?
-Sam

On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum  wrote:
> On Thu, Jan 12, 2017 at 2:03 AM,   wrote:
>> Hi all,
>> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
>> ceph-cluster. All nodes are mons and have two OSDs.
>> During reboot of one node, ceph stucks longer than normaly and I look in the
>> "ceph -w" output to find the reason.
>>
>> This is not the reason, but I'm wonder why "osd marked itself down" will not
>> recognised by the mons:
>> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
>> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
>> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
>> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
>> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
>> quorum 0,2
>> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,2
>> 0,2
>> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
>> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
>> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 kB/s
>> wr, 15 op/s
>> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
>> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
>> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 B/s
>> wr, 12 op/s
>> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
>> rd, 135 kB/s wr, 15 op/s
>> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 B/s
>> rd, 189 kB/s wr, 7 op/s
>> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 failed (2
>> reporters from different host after 21.222945 >= grace 20.388836)
>> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 failed (2
>> reporters from different host after 21.222970 >= grace 20.388836)
>> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>> 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
>> 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>> 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in
>>
>> Why trust the mon not the osd? In this case the osdmap will be right app. 26
>> seconds earlier (the pgmap at 10:18:27.953410 is wrong).
>>
>> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
> That's not what anybody intended to have happen. It's possible the
> simultaneous loss of a monitor and the OSDs is triggering a case
> that's not behaving correctly. Can you create a ticket at
> tracker.ceph.com with your logs and what steps you took and symptoms
> observed?
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any librados C API users out there?

2017-01-11 Thread Samuel Just
Jason: librbd itself uses the librados C++ api though, right?
-Sam

On Wed, Jan 11, 2017 at 9:37 AM, Jason Dillaman  wrote:
> +1
>
> I'd be happy to tweak the internals of librbd to support pass-through
> of C buffers all the way to librados. librbd clients like QEMU use the
> C API and this currently results in several extra copies (in librbd
> and librados).
>
> On Wed, Jan 11, 2017 at 11:44 AM, Piotr Dałek  
> wrote:
>> Hello,
>>
>> As the subject says - are here any users/consumers of librados C API? I'm
>> asking because we're researching if this PR:
>> https://github.com/ceph/ceph/pull/12216 will be actually beneficial for
>> larger group of users. This PR adds a bunch of new APIs that perform object
>> writes without intermediate data copy, which will reduce cpu and memory load
>> on clients. If you're using librados C API for object writes, feel free to
>> comment here or in the pull request.
>>
>>
>> --
>> Piotr Dałek
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Samuel Just
Mm, maybe the tag didn't get pushed.  Alfredo, is there supposed to be
a v11.1.1 tag?
-Sam

On Tue, Jan 10, 2017 at 9:57 AM, Stillwell, Bryan J
<bryan.stillw...@charter.com> wrote:
> That's strange, I installed that version using packages from here:
>
> http://download.ceph.com/debian-kraken/pool/main/c/ceph/
>
>
> Bryan
>
> On 1/10/17, 10:51 AM, "Samuel Just" <sj...@redhat.com> wrote:
>
>>Can you push that branch somewhere?  I don't have a v11.1.1 or that sha1.
>>-Sam
>>
>>On Tue, Jan 10, 2017 at 9:41 AM, Stillwell, Bryan J
>><bryan.stillw...@charter.com> wrote:
>>> This is from:
>>>
>>> ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755)
>>>
>>> On 1/10/17, 10:23 AM, "Samuel Just" <sj...@redhat.com> wrote:
>>>
>>>>What ceph sha1 is that?  Does it include
>>>>6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger
>>>>spin?
>>>>-Sam
>>>>
>>>>On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J
>>>><bryan.stillw...@charter.com> wrote:
>>>>> On 1/10/17, 5:35 AM, "John Spray" <jsp...@redhat.com> wrote:
>>>>>
>>>>>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>>>>>><bryan.stillw...@charter.com> wrote:
>>>>>>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>>>>>>> single node, two OSD cluster, and after a while I noticed that the
>>>>>>>new
>>>>>>> ceph-mgr daemon is frequently using a lot of the CPU:
>>>>>>>
>>>>>>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>>>>>>> ceph-mgr
>>>>>>>
>>>>>>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its
>>>>>>>CPU
>>>>>>> usage down to < 1%, but after a while it climbs back up to > 100%.
>>>>>>>Has
>>>>>>> anyone else seen this?
>>>>>>
>>>>>>Definitely worth investigating, could you set "debug mgr = 20" on the
>>>>>>daemon to see if it's obviously spinning in a particular place?
>>>>>
>>>>> I've injected that option to the ceps-mgr process, and now I'm just
>>>>> waiting for it to go out of control again.
>>>>>
>>>>> However, I've noticed quite a few messages like this in the logs
>>>>>already:
>>>>>
>>>>> 2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>>>>>>
>>>>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN
>>>>>pgs=2
>>>>> cs=1 l=0).fault initiating reconnect
>>>>> 2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>>>>>>
>>>>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>>>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>>>>l=0).handle_connect_msg
>>>>> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
>>>>> 2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>>>>>>
>>>>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>>>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>>>>l=0).handle_connect_msg
>>>>> accept peer reset, then tried to connect to us, replacing
>>>>> 2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>>>>>>
>>>>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
>>>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing
>>>>>to
>>>>> send and in the half  accept state just closed
>>>>>
>>>>>
>>>>> What's weird about that is that this is a single node cluster with
>>>>> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
>>>>> host.  So none of the communication should be leaving the node.
>>>>>
>>>>> Bryan
>>>>>
>>>>> E-MAIL CONFIDENTIALITY NOTICE:
>>>>> The contents of this e-mail message and any attachments are intended
>>>>>solely for the addressee(s) and may contain confidential and/or legally
>>>>>privileged information. If you are not the intended recipient o

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Samuel Just
Can you push that branch somewhere?  I don't have a v11.1.1 or that sha1.
-Sam

On Tue, Jan 10, 2017 at 9:41 AM, Stillwell, Bryan J
<bryan.stillw...@charter.com> wrote:
> This is from:
>
> ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755)
>
> On 1/10/17, 10:23 AM, "Samuel Just" <sj...@redhat.com> wrote:
>
>>What ceph sha1 is that?  Does it include
>>6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger
>>spin?
>>-Sam
>>
>>On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J
>><bryan.stillw...@charter.com> wrote:
>>> On 1/10/17, 5:35 AM, "John Spray" <jsp...@redhat.com> wrote:
>>>
>>>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>>>><bryan.stillw...@charter.com> wrote:
>>>>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>>>>> single node, two OSD cluster, and after a while I noticed that the new
>>>>> ceph-mgr daemon is frequently using a lot of the CPU:
>>>>>
>>>>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>>>>> ceph-mgr
>>>>>
>>>>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
>>>>> usage down to < 1%, but after a while it climbs back up to > 100%.
>>>>>Has
>>>>> anyone else seen this?
>>>>
>>>>Definitely worth investigating, could you set "debug mgr = 20" on the
>>>>daemon to see if it's obviously spinning in a particular place?
>>>
>>> I've injected that option to the ceps-mgr process, and now I'm just
>>> waiting for it to go out of control again.
>>>
>>> However, I've noticed quite a few messages like this in the logs
>>>already:
>>>
>>> 2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2
>>> cs=1 l=0).fault initiating reconnect
>>> 2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>>l=0).handle_connect_msg
>>> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
>>> 2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>>l=0).handle_connect_msg
>>> accept peer reset, then tried to connect to us, replacing
>>> 2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to
>>> send and in the half  accept state just closed
>>>
>>>
>>> What's weird about that is that this is a single node cluster with
>>> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
>>> host.  So none of the communication should be leaving the node.
>>>
>>> Bryan
>>>
>>> E-MAIL CONFIDENTIALITY NOTICE:
>>> The contents of this e-mail message and any attachments are intended
>>>solely for the addressee(s) and may contain confidential and/or legally
>>>privileged information. If you are not the intended recipient of this
>>>message or if this message has been addressed to you in error, please
>>>immediately alert the sender by reply e-mail and then delete this
>>>message and any attachments. If you are not the intended recipient, you
>>>are notified that any use, dissemination, distribution, copying, or
>>>storage of this message or any attachment is strictly prohibited.
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> E-MAIL CONFIDENTIALITY NOTICE:
> The contents of this e-mail message and any attachments are intended solely 
> for the addressee(s) and may contain confidential and/or legally privileged 
> information. If you are not the intended recipient of this message or if this 
> message has been addressed to you in error, please immediately alert the 
> sender by reply e-mail and then delete this message and any attachments. If 
> you are not the intended recipient, you are notified that any use, 
> dissemination, distribution, copying, or storage of this message or any 
> attachment is strictly prohibited.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Samuel Just
What ceph sha1 is that?  Does it include
6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger
spin?
-Sam

On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J
 wrote:
> On 1/10/17, 5:35 AM, "John Spray"  wrote:
>
>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>> wrote:
>>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>>> single node, two OSD cluster, and after a while I noticed that the new
>>> ceph-mgr daemon is frequently using a lot of the CPU:
>>>
>>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>>> ceph-mgr
>>>
>>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
>>> usage down to < 1%, but after a while it climbs back up to > 100%.  Has
>>> anyone else seen this?
>>
>>Definitely worth investigating, could you set "debug mgr = 20" on the
>>daemon to see if it's obviously spinning in a particular place?
>
> I've injected that option to the ceps-mgr process, and now I'm just
> waiting for it to go out of control again.
>
> However, I've noticed quite a few messages like this in the logs already:
>
> 2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2
> cs=1 l=0).fault initiating reconnect
> 2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
> 2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> accept peer reset, then tried to connect to us, replacing
> 2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to
> send and in the half  accept state just closed
>
>
> What's weird about that is that this is a single node cluster with
> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
> host.  So none of the communication should be leaving the node.
>
> Bryan
>
> E-MAIL CONFIDENTIALITY NOTICE:
> The contents of this e-mail message and any attachments are intended solely 
> for the addressee(s) and may contain confidential and/or legally privileged 
> information. If you are not the intended recipient of this message or if this 
> message has been addressed to you in error, please immediately alert the 
> sender by reply e-mail and then delete this message and any attachments. If 
> you are not the intended recipient, you are notified that any use, 
> dissemination, distribution, copying, or storage of this message or any 
> attachment is strictly prohibited.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck active+remapped and osds lose data?!

2017-01-10 Thread Samuel Just
Shinobu isn't correct, you have 9/9 osds up and running.  up does not
equal acting because crush is having trouble fulfilling the weights in
your crushmap and the acting set is being padded out with an extra osd
which happens to have the data to keep you up to the right number of
replicas.  Please refer back to Brad's post.
-Sam

On Mon, Jan 9, 2017 at 11:08 PM, Marcus Müller  wrote:
> Ok, i understand but how can I debug why they are not running as they should? 
> For me I thought everything is fine because ceph -s said they are up and 
> running.
>
> I would think of a problem with the crush map.
>
>> Am 10.01.2017 um 08:06 schrieb Shinobu Kinjo :
>>
>> e.g.,
>> OSD7 / 3 / 0 are in the same acting set. They should be up, if they
>> are properly running.
>>
>> # 9.7
>> 
>>>   "up": [
>>>   7,
>>>   3
>>>   ],
>>>   "acting": [
>>>   7,
>>>   3,
>>>   0
>>>   ],
>> 
>>
>> Here is an example:
>>
>>  "up": [
>>1,
>>0,
>>2
>>  ],
>>  "acting": [
>>1,
>>0,
>>2
>>   ],
>>
>> Regards,
>>
>>
>> On Tue, Jan 10, 2017 at 3:52 PM, Marcus Müller  
>> wrote:

 That's not perfectly correct.

 OSD.0/1/2 seem to be down.
>>>
>>>
>>> Sorry but where do you see this? I think this indicates that they are up:   
>>> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs?
>>>
>>>
 Am 10.01.2017 um 07:50 schrieb Shinobu Kinjo :

 On Tue, Jan 10, 2017 at 3:44 PM, Marcus Müller  
 wrote:
> All osds are currently up:
>
>health HEALTH_WARN
>   4 pgs stuck unclean
>   recovery 4482/58798254 objects degraded (0.008%)
>   recovery 420522/58798254 objects misplaced (0.715%)
>   noscrub,nodeep-scrub flag(s) set
>monmap e9: 5 mons at
> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>   election epoch 478, quorum 0,1,2,3,4
> ceph1,ceph2,ceph3,ceph4,ceph5
>osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>   flags noscrub,nodeep-scrub
> pgmap v9981077: 320 pgs, 3 pools, 4837 GB data, 19140 kobjects
>   15070 GB used, 40801 GB / 55872 GB avail
>   4482/58798254 objects degraded (0.008%)
>   420522/58798254 objects misplaced (0.715%)
>316 active+clean
>  4 active+remapped
> client io 56601 B/s rd, 45619 B/s wr, 0 op/s
>
> This did not chance for two days or so.
>
>
> By the way, my ceph osd df now looks like this:
>
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR
> 0 1.28899  1.0  3724G  1699G  2024G 45.63 1.69
> 1 1.57899  1.0  3724G  1708G  2015G 45.87 1.70
> 2 1.68900  1.0  3724G  1695G  2028G 45.54 1.69
> 3 6.78499  1.0  7450G  1241G  6208G 16.67 0.62
> 4 8.3  1.0  7450G  1228G  6221G 16.49 0.61
> 5 9.51500  1.0  7450G  1239G  6210G 16.64 0.62
> 6 7.66499  1.0  7450G  1265G  6184G 16.99 0.63
> 7 9.75499  1.0  7450G  2497G  4952G 33.52 1.24
> 8 9.32999  1.0  7450G  2495G  4954G 33.49 1.24
> TOTAL 55872G 15071G 40801G 26.97
> MIN/MAX VAR: 0.61/1.70  STDDEV: 13.16
>
> As you can see, now osd2 also went down to 45% Use and „lost“ data. But I
> also think this is no problem and ceph just clears everything up after
> backfilling.
>
>
> Am 10.01.2017 um 07:29 schrieb Shinobu Kinjo :
>
> Looking at ``ceph -s`` you originally provided, all OSDs are up.
>
> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>
>
> But looking at ``pg query``, OSD.0 / 1 are not up. Are they something

 That's not perfectly correct.

 OSD.0/1/2 seem to be down.

> like related to ?:
>
> Ceph1, ceph2 and ceph3 are vms on one physical host
>
>
> Are those OSDs running on vm instances?
>
> # 9.7
> 
>
> "state": "active+remapped",
> "snap_trimq": "[]",
> "epoch": 3114,
> "up": [
> 7,
> 3
> ],
> "acting": [
> 7,
> 3,
> 0
> ],
>
> 
>
> # 7.84
> 
>
> "state": "active+remapped",
> "snap_trimq": "[]",
> "epoch": 3114,
> "up": [
> 4,
> 8
> ],
> "acting": [
> 4,
> 8,
> 1
> ],
>
> 
>
> # 8.1b
> 
>
> "state": "active+remapped",
> "snap_trimq": "[]",
> "epoch": 3114,
> "up": [
> 4,
> 7
> ],
> "acting": [
> 4,
> 7,
> 2
> ],
>
> 
>
> # 7.7a
> 
>
> "state": "active+remapped",
> "snap_trimq": "[]",
> "epoch": 3114,
> "up": [
> 7,
> 4
> ],
> "acting": [

Re: [ceph-users] pg stuck in peering while power failure

2017-01-10 Thread Samuel Just
{
"name": "Started\/Primary\/Peering",
"enter_time": "2017-01-10 13:43:34.933074",
"past_intervals": [
{
"first": 75858,
"last": 75860,
"maybe_went_rw": 1,
"up": [
345,
622,
685,
183,
792,
2147483647,
2147483647,
401,
516
],
"acting": [
345,
622,
685,
183,
792,
2147483647,
2147483647,
401,
516
],
"primary": 345,
"up_primary": 345
},

Between 75858 and 75860,

345,
622,
685,
183,
792,
2147483647,
2147483647,
401,
516

was the acting set.  The current acting set

345,
622,
685,
183,
2147483647,
2147483647,
153,
401,
516

needs *all 7* of the osds from epochs 75858 through 75860 to ensure
that it has any writes completed during that time.  You can make
transient situations like that less of a problem by setting min_size
to 8 (though it'll prevent writes with 2 failures until backfill
completes).  A possible enhancement for an EC pool would be to gather
the infos from those osds anyway and use that rule out writes (if they
actually happened, you'd still be stuck).
-Sam

On Tue, Jan 10, 2017 at 5:36 AM, Craig Chi  wrote:
> Hi List,
>
> I am testing the stability of my Ceph cluster with power failure.
>
> I brutally powered off 2 Ceph units with each 90 OSDs on it while the client
> I/O was continuing.
>
> Since then, some of the pgs of my cluster stucked in peering
>
>   pgmap v3261136: 17408 pgs, 4 pools, 176 TB data, 5082 kobjects
> 236 TB used, 5652 TB / 5889 TB avail
> 8563455/38919024 objects degraded (22.003%)
>13526 active+undersized+degraded
> 3769 active+clean
>  104 down+remapped+peering
>9 down+peering
>
> I queried the peering pg (all on EC pool with 7+2) and got blocked
> information (full query: http://pastebin.com/pRkaMG2h )
>
> "probing_osds": [
> "153(6)",
> "183(3)",
> "345(0)",
> "401(7)",
> "516(8)",
> "622(1)",
> "685(2)"
> ],
> "blocked": "peering is blocked due to down osds",
> "down_osds_we_would_probe": [
> 792
> ],
> "peering_blocked_by": [
> {
> "osd": 792,
> "current_lost_at": 0,
> "comment": "starting or marking this osd lost may let us
> proceed"
> }
> ]
>
>
> osd.792 is exactly on one of the units I powered off. And I think the I/O
> associated with this pg is paused too.
>
> I have checked the troubleshooting page on Ceph website (
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
> ), it says that start the OSD or mark it lost can make the procedure
> continue.
>
> I am sure that my cluster was healthy before the power outage occurred. I am
> wondering if the power outage really happens in production environment, will
> it also freeze my client I/O if I don't do anything? Since I just lost 2
> redundancies (I have erasure code with 7+2), I think it should still serve
> normal functionality.
>
> Or if I am doing something wrong? Please give me some suggestions, thanks.
>
> Sincerely,
> Craig Chi
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kraken 11.x feedback

2016-12-09 Thread Samuel Just
Is there a particular reason you are sticking to the versions with
shorter support periods?
-Sam

On Fri, Dec 9, 2016 at 11:38 AM, Ben Hines  wrote:
> Anyone have any good / bad experiences with Kraken? I haven't seen much
> discussion of it. Particularly from the RGW front.
>
> I'm still on Infernalis for our cluster, considering going up to K.
>
> thanks,
>
> -Ben
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-07 Thread Samuel Just
Actually, Greg and Sage are working up other branches, nvm.
-Sam

On Wed, Dec 7, 2016 at 2:52 PM, Samuel Just <sj...@redhat.com> wrote:
> I just pushed a branch wip-14120-10.2.4 with a possible fix.
>
> https://github.com/ceph/ceph/pull/12349/ is a fix for a known bug
> which didn't quite make it into 10.2.4, it's possible that
> 165e5abdbf6311974d4001e43982b83d06f9e0cc which did made the bug much
> more likely to happen.  wip-14120-10.2.4 has that fix cherry-picked on
> top of 10.2.4.  Can you try it and let us know the result?
> -Sam
>
> On Wed, Dec 7, 2016 at 2:42 PM, Ruben Kerkhof <ru...@rubenkerkhof.com> wrote:
>> On Wed, Dec 7, 2016 at 11:33 PM, Ruben Kerkhof <ru...@rubenkerkhof.com> 
>> wrote:
>>>> And another interesting information (maybe). I have ceph-osd process with 
>>>> big cpu load (as Steve said no iowait and no excessive memory usage). If I 
>>>> restart the ceph-osd daemon cpu load becomes OK during exactly 15 minutes 
>>>> for me. After 15 minutes, I have the cpu load again. It's curious this 
>>>> number of 15 minutes, isn't it?
>>>
>>> Thanks, l'll check how long it takes for this to happen on my cluster.
>>
>> Indeed, 15 minutes, on the dot.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-07 Thread Samuel Just
I just pushed a branch wip-14120-10.2.4 with a possible fix.

https://github.com/ceph/ceph/pull/12349/ is a fix for a known bug
which didn't quite make it into 10.2.4, it's possible that
165e5abdbf6311974d4001e43982b83d06f9e0cc which did made the bug much
more likely to happen.  wip-14120-10.2.4 has that fix cherry-picked on
top of 10.2.4.  Can you try it and let us know the result?
-Sam

On Wed, Dec 7, 2016 at 2:42 PM, Ruben Kerkhof  wrote:
> On Wed, Dec 7, 2016 at 11:33 PM, Ruben Kerkhof  wrote:
>>> And another interesting information (maybe). I have ceph-osd process with 
>>> big cpu load (as Steve said no iowait and no excessive memory usage). If I 
>>> restart the ceph-osd daemon cpu load becomes OK during exactly 15 minutes 
>>> for me. After 15 minutes, I have the cpu load again. It's curious this 
>>> number of 15 minutes, isn't it?
>>
>> Thanks, l'll check how long it takes for this to happen on my cluster.
>
> Indeed, 15 minutes, on the dot.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How are replicas spread in default crush configuration?

2016-11-23 Thread Samuel Just
On Wed, Nov 23, 2016 at 4:11 PM, Chris Taylor  wrote:
> Kevin,
>
> After changing the pool size to 3, make sure the min_size is set to 1 to
> allow 2 of the 3 hosts to be offline.

If you do this, the flip side is that while in that configuration
losing that single
host will render your data unrecoverable (writes were only witnessed by that
osd).

>
> http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values
>
> How many MONs do you have and are they on the same OSD hosts? If you have 3
> MONs running on the OSD hosts and two go offline, you will not have a quorum
> of MONs and I/O will be blocked.
>
> I would also check your CRUSH map. I believe you want to make sure your
> rules have "step chooseleaf firstn 0 type host" and not "... type osd" so
> that replicas are on different hosts. I have not had to make that change
> before so you will want to read up on it first. Don't take my word for it.
>
> http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-map-parameters
>
> Hope that helps.
>
>
>
> Chris
>
>
>
> On 2016-11-23 1:32 pm, Kevin Olbrich wrote:
>
> Hi,
>
> just to make sure, as I did not find a reference in the docs:
> Are replicas spread across hosts or "just" OSDs?
>
> I am using a 5 OSD cluster (4 pools, 128 pgs each) with size = 2. Currently
> each OSD is a ZFS backed storage array.
> Now I installed a server which is planned to host 4x OSDs (and setting size
> to 3).
>
> I want to make sure we can resist two offline hosts (in terms of hardware).
> Is my assumption correct?
>
> Mit freundlichen Grüßen / best regards,
> Kevin Olbrich.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how possible is that ceph cluster crash

2016-11-23 Thread Samuel Just
Seems like that would be helpful.  I'm not really familiar with
ceph-disk though.
-Sam

On Wed, Nov 23, 2016 at 5:24 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> Hi Sam,
>
> Would a check in ceph-disk for "nobarrier" in the osd_mount_options_{fstype} 
> variable be a good idea? It good either strip it out or
> fail to start the OSD unless an override flag is specified somewhere.
>
> Looking at ceph-disk code, I would imagine around here would be the right 
> place to put the check
> https://github.com/ceph/ceph/blob/master/src/ceph-disk/ceph_disk/main.py#L2642
>
> I don't mind trying to get this done if its felt to be worthwhile.
>
> Nick
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Samuel Just
>> Sent: 19 November 2016 00:31
>> To: Nick Fisk <n...@fisk.me.uk>
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] how possible is that ceph cluster crash
>>
>> Many reasons:
>>
>> 1) You will eventually get a DC wide power event anyway at which point 
>> probably most of the OSDs will have hopelessly corrupted
>> internal xfs structures (yes, I have seen this happen to a poor soul with a 
>> DC with redundant power).
>> 2) Even in the case of a single rack/node power failure, the biggest danger 
>> isn't that the OSDs don't start.  It's that they *do
> start*, but
>> forgot or arbitrarily corrupted a random subset of transactions they told 
>> other osds and clients that they committed.  The exact
> impact
>> would be random, but for sure, any guarantees Ceph normally provides would 
>> be out the window.  RBD devices could have random
>> byte ranges zapped back in time (not great if they're the offsets assigned 
>> to your database or fs journal...) for instance.
>> 3) Deliberately powercycling a node counts as a power failure if you don't 
>> stop services and sync etc first.
>>
>> In other words, don't mess with the definition of "committing a transaction" 
>> if you value your data.
>> -Sam "just say no" Just
>>
>> On Fri, Nov 18, 2016 at 4:04 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>> > Yes, because these things happen
>> >
>> > http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_inter
>> > ruption/
>> >
>> > We had customers who had kit in this DC.
>> >
>> > To use your analogy, it's like crossing the road at traffic lights but
>> > not checking cars have stopped. You might be OK 99%of the time, but
>> > sooner or later it will bite you in the arse and it won't be pretty.
>> >
>> > 
>> > From: "Brian ::" <b...@iptel.co>
>> > Sent: 18 Nov 2016 11:52 p.m.
>> > To: sj...@redhat.com
>> > Cc: Craig Chi; ceph-users@lists.ceph.com; Nick Fisk
>> > Subject: Re: [ceph-users] how possible is that ceph cluster crash
>> >
>> >> X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com)
>> >> X-Assp-Spam-Level: *
>> >> X-Assp-Envelope-From: b...@iptel.co
>> >> X-Assp-Intended-For: n...@fisk.me.uk
>> >> X-Assp-ID: ASSP.fisk.me.uk (47951-11296)
>> >> X-Assp-Version: 1.9.1.4(1.0.00)
>> >>
>> >>
>> >> This is like your mother telling not to cross the road when you were
>> >> 4 years of age but not telling you it was because you could be
>> >> flattened by a car :)
>> >>
>> >> Can you expand on your answer? If you are in a DC with AB power,
>> >> redundant UPS, dual feed from the electric company, onsite
>> >> generators, dual PSU servers, is it still a bad idea?
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>>
>> >>> Never *ever* use nobarrier with ceph under *any* circumstances.  I
>> >>> cannot stress this enough.
>> >>> -Sam
>> >>>
>> >>> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craig...@synology.com>
>> >>> wrote:
>> >>>>
>> >>>> Hi Nick and other Cephers,
>> >>>>
>> >>>> Thanks for your reply.
>> >>>>
>> >>>>> 2) Config Errors
>> >>>>> This can be an easy one to say you are safe from. But I would say
>> >>>>> most outages and data loss incidents I have seen on the mailing
>> >>>&

Re: [ceph-users] how possible is that ceph cluster crash

2016-11-18 Thread Samuel Just
Many reasons:

1) You will eventually get a DC wide power event anyway at which point
probably most of the OSDs will have hopelessly corrupted internal xfs
structures (yes, I have seen this happen to a poor soul with a DC with
redundant power).
2) Even in the case of a single rack/node power failure, the biggest
danger isn't that the OSDs don't start.  It's that they *do start*,
but forgot or arbitrarily corrupted a random subset of transactions
they told other osds and clients that they committed.  The exact
impact would be random, but for sure, any guarantees Ceph normally
provides would be out the window.  RBD devices could have random byte
ranges zapped back in time (not great if they're the offsets assigned
to your database or fs journal...) for instance.
3) Deliberately powercycling a node counts as a power failure if you
don't stop services and sync etc first.

In other words, don't mess with the definition of "committing a
transaction" if you value your data.
-Sam "just say no" Just

On Fri, Nov 18, 2016 at 4:04 PM, Nick Fisk <n...@fisk.me.uk> wrote:
> Yes, because these things happen
>
> http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_interruption/
>
> We had customers who had kit in this DC.
>
> To use your analogy, it's like crossing the road at traffic lights but not
> checking cars have stopped. You might be OK 99%of the time, but sooner or
> later it will bite you in the arse and it won't be pretty.
>
> 
> From: "Brian ::" <b...@iptel.co>
> Sent: 18 Nov 2016 11:52 p.m.
> To: sj...@redhat.com
> Cc: Craig Chi; ceph-users@lists.ceph.com; Nick Fisk
> Subject: Re: [ceph-users] how possible is that ceph cluster crash
>
>> X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com)
>> X-Assp-Spam-Level: *
>> X-Assp-Envelope-From: b...@iptel.co
>> X-Assp-Intended-For: n...@fisk.me.uk
>> X-Assp-ID: ASSP.fisk.me.uk (47951-11296)
>> X-Assp-Version: 1.9.1.4(1.0.00)
>>
>>
>> This is like your mother telling not to cross the road when you were 4
>> years of age but not telling you it was because you could be flattened
>> by a car :)
>>
>> Can you expand on your answer? If you are in a DC with AB power,
>> redundant UPS, dual feed from the electric company, onsite generators,
>> dual PSU servers, is it still a bad idea?
>>
>>
>>
>>
>> On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> Never *ever* use nobarrier with ceph under *any* circumstances.  I
>>> cannot stress this enough.
>>> -Sam
>>>
>>> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craig...@synology.com>
>>> wrote:
>>>>
>>>> Hi Nick and other Cephers,
>>>>
>>>> Thanks for your reply.
>>>>
>>>>> 2) Config Errors
>>>>> This can be an easy one to say you are safe from. But I would say most
>>>>> outages and data loss incidents I have seen on the mailing
>>>>> lists have been due to poor hardware choice or configuring options such
>>>>> as
>>>>> size=2, min_size=1 or enabling stuff like nobarriers.
>>>>
>>>>
>>>> I am wondering the pros and cons of the nobarrier option used by Ceph.
>>>>
>>>> It is well known that nobarrier is dangerous when power outage happens,
>>>> but
>>>> if we already have replicas in different racks or PDUs, will Ceph reduce
>>>> the
>>>> risk of data lost with this option?
>>>>
>>>> I have seen many performance tuning articles providing nobarrier option
>>>> in
>>>> xfs, but there are not many of then mention the trade-off of nobarrier.
>>>>
>>>> Is it really unacceptable to use nobarrier in production environment? I
>>>> will
>>>> be much grateful if you guys are willing to share any experiences about
>>>> nobarrier and xfs.
>>>>
>>>> Sincerely,
>>>> Craig Chi (Product Developer)
>>>> Synology Inc. Taipei, Taiwan. Ext. 361
>>>>
>>>> On 2016-11-17 05:04, Nick Fisk <n...@fisk.me.uk> wrote:
>>>>
>>>>> -Original Message-
>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>>>> Of
>>>>> Pedro Benites
>>>>> Sent: 16 November 2016 17:51
>>>>> To: ceph-users@lists.ceph.com
>>>>> Subject: [ceph-users] how possible is that ceph cluster crash
>>>>>
>>>>> Hi,
&

Re: [ceph-users] After OSD Flap - FAILED assert(oi.version == i->first)

2016-11-17 Thread Samuel Just
Puzzling, added a question to the ticket.
-Sam

On Thu, Nov 17, 2016 at 4:32 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> Hi Sam,
>
> I've updated the ticket with logs from the wip run.
>
> Nick
>
>> -Original Message-
>> From: Samuel Just [mailto:sj...@redhat.com]
>> Sent: 15 November 2016 18:30
>> To: Nick Fisk <n...@fisk.me.uk>
>> Cc: Ceph Users <ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] After OSD Flap - FAILED assert(oi.version == 
>> i->first)
>>
>> http://tracker.ceph.com/issues/17916
>>
>> I just pushed a branch wip-17916-jewel based on v10.2.3 with some additional 
>> debugging.  Once it builds, would you be able to start
>> the afflicted osds with that version of ceph-osd and
>>
>> debug osd = 20
>> debug ms = 1
>> debug filestore = 20
>>
>> and get me the log?
>> -Sam
>>
>> On Tue, Nov 15, 2016 at 2:06 AM, Nick Fisk <n...@fisk.me.uk> wrote:
>> > Hi,
>> >
>> > I have two OSD's which are failing with an assert which looks related
>> > to missing objects. This happened after a large RBD snapshot was
>> > deleted causing several OSD's to start flapping as they experienced
>> > high load. Cluster is fully recovered and I don't need any help from a 
>> > recovery perspective. I'm happy to Zap and recreate OSD's,
>> which I will probably do in a couple of days time. Or if anybody looks at 
>> the error and see's an easy way to get the OSD to start up, then
>> bonus!!!
>> >
>> > However, I thought I would post in case there is any interest in
>> > trying to diagnose why this happened. There was no power or networking 
>> > issues and no hard reboot's, so this is purely contained
>> within the Ceph OSD process.
>> >
>> > The objects that it claims are missing are from the RBD that had the
>> > snapshot deleted. I'm guessing that the last command before the OSD
>> > died at some point was to delete those two objects which did actually 
>> > happen, but for some reason the OSD had died before it got
>> confirmation??? And now it's trying to delete them, but they don't exist.
>> >
>> > I have the full debug 20 log, but pretty much all the lines above the
>> > below snippet just have it deleting thousands of objects without any 
>> > problems.
>> >
>> > Nick
>> >
>> >  -4> 2016-11-15 09:46:52.061643 7f728f9368c0 20 read_log 6 divergent_priors
>> > -3> 2016-11-15 09:46:52.061779 7f728f9368c0 10 read_log checking for 
>> > missing items over interval (0'0,1607344'260104]
>> > -2> 2016-11-15 09:46:52.069987 7f728f9368c0 15 read_log  missing
>> > 1553246'255377,1:96e51ad6:::rbd_data.6fd18238e1f29.002555c5:head
>> > -1> 2016-11-15 09:46:52.070007 7f728f9368c0 15 read_log  missing
>> > 1553190'255366,1:96e51ad6:::rbd_data.6fd18238e1f29.002555c5:6c
>> >  0> 2016-11-15 09:46:52.071471 7f728f9368c0 -1 osd/PGLog.cc: In
>> > function 'static void PGLog::read_log(ObjectStore*, coll_t, coll_t,
>> > ghobject_t, const pg_info_t&, std::map<eversion_t, hobject_t>&,
>> > PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&, const
>> > DoutPrefixProvider*, std::set<std::__cxx11::basic_string >*)'
>> > thread 7f728f9368c0 time 2016-11-15 09:46:52.070023
>> > osd/PGLog.cc: 1047: FAILED assert(oi.version == i->first)
>> >
>> >  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> > const*)+0x80) [0x5642d2734ea0]
>> >  2: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t,
>> > pg_info_t const&, std::map<eversion_t, hobject_t,
>> > std::less, std::allocator<std::pair> > hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&,
>> > std::__cxx11::basic_ostringstream<char, std::char_traits,
>> > std::allocator >&, DoutPrefixProvider const*,
>> > std::set<std::__cxx11::basic_string<char, std::char_traits,
>> > std::allocator >, std::less<std::__cxx11::basic_string<char,
>> > std::char_traits, std::allocator > >,
>> > std::allocator<std::__cxx11::basic_string<char,
>> > std::char_traits, std::allocator > > >*)+0x719)
>> > [0x5642d22e2fd9]
>> >  3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x2f6)
>> > [0x5642d21172d6]
>> >  4: (OSD::load_pgs()+0x87d) [0x5642d205345d]
>> >  5: (OSD::init()+0x2026) [0x5642d205e7a6]
>> >  6: (main()+0x2ea5) [0x5642d1fd08f5]
>> >  7: (__libc_start_main()+0xf0) [0x7f728c77c830]
>> >  8: (_start()+0x29) [0x5642d2011f89]
>> >  NOTE: a copy of the executable, or `objdump -rdS ` is needed 
>> > to interpret this.
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After OSD Flap - FAILED assert(oi.version == i->first)

2016-11-15 Thread Samuel Just
http://tracker.ceph.com/issues/17916

I just pushed a branch wip-17916-jewel based on v10.2.3 with some
additional debugging.  Once it builds, would you be able to start the
afflicted osds with that version of ceph-osd and

debug osd = 20
debug ms = 1
debug filestore = 20

and get me the log?
-Sam

On Tue, Nov 15, 2016 at 2:06 AM, Nick Fisk  wrote:
> Hi,
>
> I have two OSD's which are failing with an assert which looks related to 
> missing objects. This happened after a large RBD snapshot
> was deleted causing several OSD's to start flapping as they experienced high 
> load. Cluster is fully recovered and I don't need any
> help from a recovery perspective. I'm happy to Zap and recreate OSD's, which 
> I will probably do in a couple of days time. Or if
> anybody looks at the error and see's an easy way to get the OSD to start up, 
> then bonus!!!
>
> However, I thought I would post in case there is any interest in trying to 
> diagnose why this happened. There was no power or
> networking issues and no hard reboot's, so this is purely contained within 
> the Ceph OSD process.
>
> The objects that it claims are missing are from the RBD that had the snapshot 
> deleted. I'm guessing that the last command before the
> OSD died at some point was to delete those two objects which did actually 
> happen, but for some reason the OSD had died before it got
> confirmation??? And now it's trying to delete them, but they don't exist.
>
> I have the full debug 20 log, but pretty much all the lines above the below 
> snippet just have it deleting thousands of objects
> without any problems.
>
> Nick
>
>  -4> 2016-11-15 09:46:52.061643 7f728f9368c0 20 read_log 6 divergent_priors
> -3> 2016-11-15 09:46:52.061779 7f728f9368c0 10 read_log checking for 
> missing items over interval (0'0,1607344'260104]
> -2> 2016-11-15 09:46:52.069987 7f728f9368c0 15 read_log  missing
> 1553246'255377,1:96e51ad6:::rbd_data.6fd18238e1f29.002555c5:head
> -1> 2016-11-15 09:46:52.070007 7f728f9368c0 15 read_log  missing
> 1553190'255366,1:96e51ad6:::rbd_data.6fd18238e1f29.002555c5:6c
>  0> 2016-11-15 09:46:52.071471 7f728f9368c0 -1 osd/PGLog.cc: In function 
> 'static void PGLog::read_log(ObjectStore*, coll_t,
> coll_t, ghobject_t, const pg_info_t&, std::map&, 
> PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&,
> const DoutPrefixProvider*, std::set*)' 
> thread 7f728f9368c0 time 2016-11-15 09:46:52.070023
> osd/PGLog.cc: 1047: FAILED assert(oi.version == i->first)
>
>  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x5642d2734ea0]
>  2: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t 
> const&, std::map std::less, std::allocator > >&, PGLog::IndexedLog&, pg_missing_t&,
> std::__cxx11::basic_ostringstream std::allocator >&, DoutPrefixProvider const*,
> std::set std::allocator >, std::less std::char_traits, std::allocator > >, 
> std::allocator std::allocator > > >*)+0x719) [0x5642d22e2fd9]
>  3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x2f6) [0x5642d21172d6]
>  4: (OSD::load_pgs()+0x87d) [0x5642d205345d]
>  5: (OSD::init()+0x2026) [0x5642d205e7a6]
>  6: (main()+0x2ea5) [0x5642d1fd08f5]
>  7: (__libc_start_main()+0xf0) [0x7f728c77c830]
>  8: (_start()+0x29) [0x5642d2011f89]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cache tier on rgw index pool

2016-09-21 Thread Samuel Just
I seriously doubt that it's ever going to be a winning strategy to let
rgw index objects go to a cold tier.  Some practical problems:
1) We don't track omap size (the leveldb entries for an object)
because it would turn writes into rmw's -- so they always show up as 0
size.  Thus, the target_max_bytes param is going to be useless.
2) You can't store omap objects on an ec pool at all, so if the base
pool is an ec pool, nothing will ever be demoted.
3) We always promote whole objects.

As to point 2., I'm guessing that Greg meant that OSDs don't care
about each other's leveldb instances *directly* since leveldb itself
is behind two layers of interfaces (one osd might have bluestore using
rocksdb, while the other might have filestore with some other
key-value db entirely).  Of course, replication -- certainly including
the omap entries -- still happens, but at the object level rather than
at the key-value db level.
-Sam

On Wed, Sep 21, 2016 at 5:43 AM, Abhishek Varshney
 wrote:
> Hi,
>
> I am evaluating on setting up a cache tier for the rgw index pool and
> have a few questions regarding that. The rgw index pool is different
> as it completely stores the data in leveldb. The 'rados df' command on
> the existing index pool shows size in KB as 0 on a 1 PB cluster with
> 500 million objects running ceph 0.94.2.
>
> Seeking clarifications on the following points:
>
> 1. How are the cache tier parameters like target_max_bytes,
> cache_target_dirty_ratio and cache_target_full_ratio honoured given
> the size of index pool is shown as 0 and how does flush/eviction take
> place in this case? Is there any specific reason why the omap data is
> not reflected in the size, as Sage mentions it here [1]
>
> 2. I found a mail archive in ceph-devel where Greg mentions that
> "there's no cross-OSD LevelDB replication or communication" [2]. In
> that case,  how does ceph handle re-balancing of leveldb instance data
> in case of node failure?
>
> 3. Are there any surprises that can be expected on deploying a cache
> tier for rgw index pool ?
>
> [1] http://www.spinics.net/lists/ceph-devel/msg28635.html
> [2] http://www.spinics.net/lists/ceph-devel/msg24990.html
>
> Thanks
> Abhishek Varshney
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Same pg scrubbed over and over (Jewel)

2016-09-21 Thread Samuel Just
Ah, same question then.  If we can get logging on the primary for one
of those pgs, it should be fairly obvious.
-Sam

On Wed, Sep 21, 2016 at 4:08 AM, Pavan Rallabhandi
 wrote:
> We find this as well in our fresh built Jewel clusters, and seems to happen 
> only with a handful of PGs from couple of pools.
>
> Thanks!
>
> On 9/21/16, 3:14 PM, "ceph-users on behalf of Tobias Böhm" 
>  wrote:
>
> Hi,
>
> there is an open bug in the tracker: http://tracker.ceph.com/issues/16474
>
> It also suggests restarting OSDs as a workaround. We faced the same issue 
> after increasing the number of PGs in our cluster and restarting OSDs solved 
> it as well.
>
> Tobias
>
> > Am 21.09.2016 um 11:26 schrieb Dan van der Ster :
> >
> > There was a thread about this a few days ago:
> > 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012857.html
> > And the OP found a workaround.
> > Looks like a bug though... (by default PGs scrub at most once per day).
> >
> > -- dan
> >
> >
> >
> > On Tue, Sep 20, 2016 at 10:43 PM, Martin Bureau  
> wrote:
> >> Hello,
> >>
> >>
> >> I noticed that the same pg gets scrubbed repeatedly on our new Jewel
> >> cluster:
> >>
> >>
> >> Here's an excerpt from log:
> >>
> >>
> >> 2016-09-20 20:36:31.236123 osd.12 10.1.82.82:6820/14316 150514 : 
> cluster
> >> [INF] 25.3f scrub ok
> >> 2016-09-20 20:36:32.232918 osd.12 10.1.82.82:6820/14316 150515 : 
> cluster
> >> [INF] 25.3f scrub starts
> >> 2016-09-20 20:36:32.236876 osd.12 10.1.82.82:6820/14316 150516 : 
> cluster
> >> [INF] 25.3f scrub ok
> >> 2016-09-20 20:36:33.233268 osd.12 10.1.82.82:6820/14316 150517 : 
> cluster
> >> [INF] 25.3f deep-scrub starts
> >> 2016-09-20 20:36:33.242258 osd.12 10.1.82.82:6820/14316 150518 : 
> cluster
> >> [INF] 25.3f deep-scrub ok
> >> 2016-09-20 20:36:36.233604 osd.12 10.1.82.82:6820/14316 150519 : 
> cluster
> >> [INF] 25.3f scrub starts
> >> 2016-09-20 20:36:36.237221 osd.12 10.1.82.82:6820/14316 150520 : 
> cluster
> >> [INF] 25.3f scrub ok
> >> 2016-09-20 20:36:41.234490 osd.12 10.1.82.82:6820/14316 150521 : 
> cluster
> >> [INF] 25.3f deep-scrub starts
> >> 2016-09-20 20:36:41.243720 osd.12 10.1.82.82:6820/14316 150522 : 
> cluster
> >> [INF] 25.3f deep-scrub ok
> >> 2016-09-20 20:36:45.235128 osd.12 10.1.82.82:6820/14316 150523 : 
> cluster
> >> [INF] 25.3f deep-scrub starts
> >> 2016-09-20 20:36:45.352589 osd.12 10.1.82.82:6820/14316 150524 : 
> cluster
> >> [INF] 25.3f deep-scrub ok
> >> 2016-09-20 20:36:47.235310 osd.12 10.1.82.82:6820/14316 150525 : 
> cluster
> >> [INF] 25.3f scrub starts
> >> 2016-09-20 20:36:47.239348 osd.12 10.1.82.82:6820/14316 150526 : 
> cluster
> >> [INF] 25.3f scrub ok
> >> 2016-09-20 20:36:49.235538 osd.12 10.1.82.82:6820/14316 150527 : 
> cluster
> >> [INF] 25.3f deep-scrub starts
> >> 2016-09-20 20:36:49.243121 osd.12 10.1.82.82:6820/14316 150528 : 
> cluster
> >> [INF] 25.3f deep-scrub ok
> >> 2016-09-20 20:36:51.235956 osd.12 10.1.82.82:6820/14316 150529 : 
> cluster
> >> [INF] 25.3f deep-scrub starts
> >> 2016-09-20 20:36:51.244201 osd.12 10.1.82.82:6820/14316 150530 : 
> cluster
> >> [INF] 25.3f deep-scrub ok
> >> 2016-09-20 20:36:52.236076 osd.12 10.1.82.82:6820/14316 150531 : 
> cluster
> >> [INF] 25.3f scrub starts
> >> 2016-09-20 20:36:52.239376 osd.12 10.1.82.82:6820/14316 150532 : 
> cluster
> >> [INF] 25.3f scrub ok
> >> 2016-09-20 20:36:56.236740 osd.12 10.1.82.82:6820/14316 150533 : 
> cluster
> >> [INF] 25.3f scrub starts
> >>
> >>
> >> How can I troubleshoot / resolve this ?
> >>
> >>
> >> Regards,
> >>
> >> Martin
> >>
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] crash of osd using cephfs jewel 10.2.2, and corruption

2016-09-21 Thread Samuel Just
Looks like the OSD didn't like an error return it got from the
underlying fs.  Can you reproduce with

debug filestore = 20
debug osd = 20
debug ms = 1

on the osd and post the whole log?
-Sam

On Wed, Sep 21, 2016 at 12:10 AM, Peter Maloney
 wrote:
> Hi,
>
> I created a one disk osd with data and separate journal on the same lvm
> volume group just for test, one mon, one mds on my desktop.
>
> I managed to crash the osd just by mounting cephfs and doing cp -a of
> the linux-stable git tree into it. It crashed after copying 2.1G which
> only covers some of the .git dir and none of the rest. And then when I
> killed ceph-mds and restarted the osd and mds, ceph -s said something
> about the pgs being stuck or unclean or something, and the computer
> froze. :/ After booting again, everything is fine, and the problem was
> reproducable the same way...just copying the files again.[but after
> writing this mail, I can't seem to cause it as easily again... copying
> again works, but sha1sum doesn't, even if I drop caches]
>
> Also reading seems to do the same.
>
> And then I tried adding a 2nd osd (also from vlm, with osd and journal
> on same volume group). And that seemed to stop the crashing, but not
> sure about corruption.I guess the corruption was on the cephfs but RAM
> had good copies or something, so rebooting, etc. is what made the
> corruption appear? (I tried to reproduce, but couldn't...didn't try
> killing daemons)
>
>> root@client:/mnt/test # ls -l
>> total 447
>> drwx-- 1 root root  4 2016-09-20 20:37 1/
>> drwx-- 1 root root  4 2016-09-20 20:37 2/
>> drwx-- 1 root root  4 2016-09-20 20:37 linux-stable/
>> -rw-r--r-- 1 root root 457480 2016-09-20 21:38 sums.txt
>> root@client:/mnt/test # (cd linux-stable/; sha1sum -c --quiet
>> ../sums.txt )
> (osd crashed before that finished ... and then impressively, starting
> the osd again made the command finish gracefully... and then tried rsync
> to finish copying and 6 or so restarts later it finished with just the 1
> rsync run)
>
> And then the checksums didn't match... (corruption)
>> root@client:/mnt/test # (cd linux-stable/; sha1sum -c --quiet
>> ../sums.txt )
>> ./.git/objects/e6/635671beff26a417c02d50adeefa2a6897a9dd: FAILED
>> ./.git/objects/e6/d58d90213a4a283d428988a398281663dd68e4: FAILED
>> ./.git/objects/81/281381965b21d3c23b2f877e214c4af65d6fa4: FAILED
>> ./.git/objects/4c/f549c4a9b23638ab49cc0f8b47c395b1fc8ede: FAILED
>> sha1sum: WARNING: 4 computed checksums did NOT match
>
>> root@client:/mnt/test # hexdump -C
>> linux-stable/.git/objects/e6/635671beff26a417c02d50adeefa2a6897a9dd
>>   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
>> ||
>> *
>> 0dd0  00 00 00  |...|
>> 0dd3
>> peter@peter:~/projects $ hexdump -C
>> linux-stable/.git/objects/e6/635671beff26a417c02d50adeefa2a6897a9dd | head
>>   78 01 bd 5b 7d 6f db c6  19 df bf d4 a7 38 20 40
>> |x..[}o...8 @|
>> 0010  27 0b 8e ec 14 2b 06 24  5d 90 34 75 5c 63 4e 6c
>> |'+.$].4u\cNl|
>> 0020  d8 f1 82 62 19 08 9a 3a  59 ac 29 52 25 29 bb 5e
>> |...b...:Y.)R%).^|
>> 0030  9a ef be df f3 dc 1d 79  c7 77 39 c1 84 c0 12 79
>> |...y.w9y|
>> 0040  77 cf fb 3b 99 eb 38 bd  16 cf 7e 38 fc e1 f0 2f
>> |w..;..8...~8.../|
>> 0050  07 b3 89 98 89 fc 21 5f  e6 f3 95 78 2a 16 72 19
>> |..!_...x*.r.|
>> 0060  25 51 11 a5 49 2e 96 69  26 8a 95 c4 bd bb 28 c4
>> |%Q..I..i&.(.|
>> 0070  57 16 dd c9 4c 2c a3 58  62 7f 21 d7 38 49 87 df
>> |W...L,.Xb.!.8I..|
>> 0080  a4 9b 87 2c ba 59 15 62  1a ee 89 ef 0f 0f 9f ed
>> |...,.Y.b|
>> 0090  e3 cf f7 e2 3c 28 b2 28  bc 15 ef d2 70 25 e3 d6
>> |<(.(p%..|
> and then copying that and testing checksums again has even more failures
>
>> root@client:/mnt/test # cp -a linux-stable 3
>> root@client:/mnt/test # (cd 3/; sha1sum -c --quiet ../sums.txt )
>> ./net/iucv/iucv.c: FAILED
>> ./net/kcm/kcmsock.c: FAILED
>> ./net/irda/ircomm/ircomm_event.c: FAILED
>> ./net/irda/ircomm/ircomm_tty_attach.c: FAILED
>> ./net/llc/Makefile: FAILED
>> ./net/llc/Kconfig: FAILED
>> ./net/llc/af_llc.c: FAILED
>> ./net/lapb/lapb_timer.c: FAILED
>> ./net/lapb/lapb_subr.c: FAILED
>> ./net/lapb/lapb_iface.c: FAILED
>> ./net/lapb/Makefile: FAILED
>> ./net/lapb/lapb_in.c: FAILED
>> ./net/lapb/lapb_out.c: FAILED
>> ./net/l2tp/l2tp_eth.c: FAILED
>> ./net/l2tp/Kconfig: FAILED
>> ./net/l2tp/l2tp_core.h: FAILED
>> ./.git/objects/e6/635671beff26a417c02d50adeefa2a6897a9dd: FAILED
>> ./.git/objects/e6/d58d90213a4a283d428988a398281663dd68e4: FAILED
>> ./.git/objects/81/281381965b21d3c23b2f877e214c4af65d6fa4: FAILED
>> ./.git/objects/4c/f549c4a9b23638ab49cc0f8b47c395b1fc8ede: FAILED
>> sha1sum: WARNING: 20 computed checksums did NOT match
>> root@client:/mnt/test # hexdump -C 3/net/iucv/iucv.c
>>   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
>> ||
>> *
>> d420  

Re: [ceph-users] Same pg scrubbed over and over (Jewel)

2016-09-21 Thread Samuel Just
Can you reproduce with logging on the primary for that pg?

debug osd = 20
debug filestore = 20
debug ms = 1

Since restarting the osd may be a workaround, can you inject the debug
values without restarting the daemon?
-Sam

On Wed, Sep 21, 2016 at 2:44 AM, Tobias Böhm  wrote:
> Hi,
>
> there is an open bug in the tracker: http://tracker.ceph.com/issues/16474
>
> It also suggests restarting OSDs as a workaround. We faced the same issue 
> after increasing the number of PGs in our cluster and restarting OSDs solved 
> it as well.
>
> Tobias
>
>> Am 21.09.2016 um 11:26 schrieb Dan van der Ster :
>>
>> There was a thread about this a few days ago:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012857.html
>> And the OP found a workaround.
>> Looks like a bug though... (by default PGs scrub at most once per day).
>>
>> -- dan
>>
>>
>>
>> On Tue, Sep 20, 2016 at 10:43 PM, Martin Bureau  wrote:
>>> Hello,
>>>
>>>
>>> I noticed that the same pg gets scrubbed repeatedly on our new Jewel
>>> cluster:
>>>
>>>
>>> Here's an excerpt from log:
>>>
>>>
>>> 2016-09-20 20:36:31.236123 osd.12 10.1.82.82:6820/14316 150514 : cluster
>>> [INF] 25.3f scrub ok
>>> 2016-09-20 20:36:32.232918 osd.12 10.1.82.82:6820/14316 150515 : cluster
>>> [INF] 25.3f scrub starts
>>> 2016-09-20 20:36:32.236876 osd.12 10.1.82.82:6820/14316 150516 : cluster
>>> [INF] 25.3f scrub ok
>>> 2016-09-20 20:36:33.233268 osd.12 10.1.82.82:6820/14316 150517 : cluster
>>> [INF] 25.3f deep-scrub starts
>>> 2016-09-20 20:36:33.242258 osd.12 10.1.82.82:6820/14316 150518 : cluster
>>> [INF] 25.3f deep-scrub ok
>>> 2016-09-20 20:36:36.233604 osd.12 10.1.82.82:6820/14316 150519 : cluster
>>> [INF] 25.3f scrub starts
>>> 2016-09-20 20:36:36.237221 osd.12 10.1.82.82:6820/14316 150520 : cluster
>>> [INF] 25.3f scrub ok
>>> 2016-09-20 20:36:41.234490 osd.12 10.1.82.82:6820/14316 150521 : cluster
>>> [INF] 25.3f deep-scrub starts
>>> 2016-09-20 20:36:41.243720 osd.12 10.1.82.82:6820/14316 150522 : cluster
>>> [INF] 25.3f deep-scrub ok
>>> 2016-09-20 20:36:45.235128 osd.12 10.1.82.82:6820/14316 150523 : cluster
>>> [INF] 25.3f deep-scrub starts
>>> 2016-09-20 20:36:45.352589 osd.12 10.1.82.82:6820/14316 150524 : cluster
>>> [INF] 25.3f deep-scrub ok
>>> 2016-09-20 20:36:47.235310 osd.12 10.1.82.82:6820/14316 150525 : cluster
>>> [INF] 25.3f scrub starts
>>> 2016-09-20 20:36:47.239348 osd.12 10.1.82.82:6820/14316 150526 : cluster
>>> [INF] 25.3f scrub ok
>>> 2016-09-20 20:36:49.235538 osd.12 10.1.82.82:6820/14316 150527 : cluster
>>> [INF] 25.3f deep-scrub starts
>>> 2016-09-20 20:36:49.243121 osd.12 10.1.82.82:6820/14316 150528 : cluster
>>> [INF] 25.3f deep-scrub ok
>>> 2016-09-20 20:36:51.235956 osd.12 10.1.82.82:6820/14316 150529 : cluster
>>> [INF] 25.3f deep-scrub starts
>>> 2016-09-20 20:36:51.244201 osd.12 10.1.82.82:6820/14316 150530 : cluster
>>> [INF] 25.3f deep-scrub ok
>>> 2016-09-20 20:36:52.236076 osd.12 10.1.82.82:6820/14316 150531 : cluster
>>> [INF] 25.3f scrub starts
>>> 2016-09-20 20:36:52.239376 osd.12 10.1.82.82:6820/14316 150532 : cluster
>>> [INF] 25.3f scrub ok
>>> 2016-09-20 20:36:56.236740 osd.12 10.1.82.82:6820/14316 150533 : cluster
>>> [INF] 25.3f scrub starts
>>>
>>>
>>> How can I troubleshoot / resolve this ?
>>>
>>>
>>> Regards,
>>>
>>> Martin
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD daemon randomly stops

2016-09-02 Thread Samuel Just
Probably an EIO.  You can reproduce with debug filestore = 20 to confirm.
-Sam

On Fri, Sep 2, 2016 at 10:18 AM, Reed Dier  wrote:
> OSD has randomly stopped for some reason. Lots of recovery processes
> currently running on the ceph cluster. OSD log with assert below:
>
> -14> 2016-09-02 11:32:38.672460 7fcf65514700  5 -- op tracker -- seq: 1147,
> time: 2016-09-02 11:32:38.672460, event: queued_for_pg, op:
> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>-13> 2016-09-02 11:32:38.672533 7fcf70d40700  5 -- op tracker -- seq:
> 1147, time: 2016-09-02 11:32:38.672533, event: reached_pg, op:
> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>-12> 2016-09-02 11:32:38.672548 7fcf70d40700  5 -- op tracker -- seq:
> 1147, time: 2016-09-02 11:32:38.672548, event: started, op:
> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>-11> 2016-09-02 11:32:38.672548 7fcf7cd58700  1 -- [].28:6800/27735 <==
> mon.0 [].249:6789/0 60  pg_stats_ack(0 pgs tid 45) v1  4+0+0 (0 0 0)
> 0x55a4443b1400 con 0x55a4434a4e80
>-10> 2016-09-02 11:32:38.672559 7fcf70d40700  1 -- [].28:6801/27735 -->
> [].31:6801/2070838 -- osd_sub_op(unknown.0.0:0 7.d1 MIN [scrub-unreserve] v
> 0'0 snapset=0=[]:[]) v12 -- ?+0 0x55a443aec100 con 0x55a443be0600
> -9> 2016-09-02 11:32:38.672571 7fcf70d40700  5 -- op tracker -- seq:
> 1147, time: 2016-09-02 11:32:38.672571, event: done, op:
> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
> -8> 2016-09-02 11:32:38.681929 7fcf7b555700  1 -- [].28:6801/27735 <==
> osd.2 [].26:6801/9468 148  MBackfillReserve GRANT  pgid: 15.11,
> query_epoch: 4235 v3  30+0+0 (3067148394 0 0) 0x55a4441f65a0 con
> 0x55a4434ab200
> -7> 2016-09-02 11:32:38.682009 7fcf7b555700  5 -- op tracker -- seq:
> 1148, time: 2016-09-02 11:32:38.682008, event: done, op: MBackfillReserve
> GRANT  pgid: 15.11, query_epoch: 4235
> -6> 2016-09-02 11:32:38.682068 7fcf73545700  5 osd.4 pg_epoch: 4235
> pg[15.11( v 895'400028 (859'397021,895'400028] local-les=4234 n=166739
> ec=732 les/c/f 4234/4003/0 4232/4233/4233) [2,4]/[4] r=0 lpr=4233
> pi=4002-4232/47 (log bound mismatch
> , actual=[859'396822,895'400028]) bft=2 crt=895'400028 lcod 0'0 mlcod 0'0
> active+undersized+degraded+remapped+wait_backfill] exit
> Started/Primary/Active/WaitRemoteBackfillReserved 221.748180 6 0.56
> -5> 2016-09-02 11:32:38.682109 7fcf73545700  5 osd.4 pg_epoch: 4235
> pg[15.11( v 895'400028 (859'397021,895'400028] local-les=4234 n=166739
> ec=732 les/c/f 4234/4003/0 4232/4233/4233) [2,4]/[4] r=0 lpr=4233
> pi=4002-4232/47 (log bound mismatch
> , actual=[859'396822,895'400028]) bft=2 crt=895'400028 lcod 0'0 mlcod 0'0
> active+undersized+degraded+remapped+wait_backfill] enter
> Started/Primary/Active/Backfilling
> -4> 2016-09-02 11:32:38.682584 7fcf7b555700  1 -- [].28:6801/27735 <==
> osd.6 [].30:6801/44406 171  osd pg remove(epoch 4235; pg6.19; ) v2 
> 30+0+0 (522063165 0 0) 0x55a44392f680 con 0x55a443bae100
> -3> 2016-09-02 11:32:38.682600 7fcf7b555700  5 -- op tracker -- seq:
> 1149, time: 2016-09-02 11:32:38.682600, event: started, op: osd pg
> remove(epoch 4235; pg6.19; )
> -2> 2016-09-02 11:32:38.682616 7fcf7b555700  5 osd.4 4235
> queue_pg_for_deletion: 6.19
> -1> 2016-09-02 11:32:38.685425 7fcf7b555700  5 -- op tracker -- seq:
> 1149, time: 2016-09-02 11:32:38.685421, event: done, op: osd pg remove(epoch
> 4235; pg6.19; )
>  0> 2016-09-02 11:32:38.690487 7fcf6c537700 -1 osd/ReplicatedPG.cc: In
> function 'void ReplicatedPG::scan_range(int, int, PG::BackfillInterval*,
> ThreadPool::TPHandle&)' thread 7fcf6c537700 time 2016-09-02 11:32:38.688536
> osd/ReplicatedPG.cc: 11345: FAILED assert(r >= 0)
>
>  2016-09-02 11:32:38.711869 7fcf6c537700 -1 *** Caught signal (Aborted) **
>
>  in thread 7fcf6c537700 thread_name:tp_osd_recov
>
>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>  1: (()+0x8ebb02) [0x55a402375b02]
>  2: (()+0x10330) [0x7fcfa2b51330]
>  3: (gsignal()+0x37) [0x7fcfa0bb3c37]
>  4: (abort()+0x148) [0x7fcfa0bb7028]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x265) [0x55a40246cf85]
>  6: (ReplicatedPG::scan_range(int, int, PG::BackfillInterval*,
> ThreadPool::TPHandle&)+0xad2) [0x55a401f4f482]
>  7: (ReplicatedPG::update_range(PG::BackfillInterval*,
> ThreadPool::TPHandle&)+0x614) [0x55a401f4fac4]
>  8: (ReplicatedPG::recover_backfill(int, ThreadPool::TPHandle&,
> bool*)+0x337) [0x55a401f6fc87]
>  9: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&,
> int*)+0x8a0) [0x55a401fa1160]
>  10: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x355) [0x55a401e31555]
>  11: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0xd)
> [0x55a401e7a0dd]
>  12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x55a40245e18e]
>  13: (ThreadPool::WorkThread::entry()+0x10) [0x55a40245f070]
>  14: 

Re: [ceph-users] Strange copy errors in osd log

2016-09-01 Thread Samuel Just
If it's bluestore, this is pretty likely to be a bluestore bug.  If
you are interested in experimenting with bluestore, you probably want
to watch developements on the master branch, it's undergoing a bunch
of changes right now.
-Sam

On Thu, Sep 1, 2016 at 1:54 PM, Виталий Филиппов  wrote:
> Hi! I'm playing with a test setup of ceph jewel with bluestore and cephfs
> over erasure-coded pool with replicated pool as a cache tier. After writing
> some number of small files to cephfs I begin seeing the following error
> messages during the migration of data from cache to EC pool:
>
> 2016-09-01 10:19:27.364710 7f37c1a09700 -1 osd.0 pg_epoch: 329 pg[6.2cs0( v
> 329'388 (0'0,329'388] local-les=315 n=326 ec=279 les/c/f 315/315/0
> 314/314/314) [0,1,2] r=0 lpr=314 crt=329'387 lcod 329'387 mlcod 329'387
> active+clean] process_copy_chunk data digest 0x648fd38c != source 0x40203b61
> 2016-09-01 10:19:27.364742 7f37c1a09700 -1 log_channel(cluster) log [ERR] :
> 6.2cs0 copy from 8:372dc315:::200.002b:head to
> 6:372dc315:::200.002b:head data digest 0x648fd38c != source 0x40203b61
>
> These messages then repeat infinitely for the same set of objects with some
> interval. I'm not sure - does this mean some objects are corrupted in OSDs?
> (how to check?) Is it a bug at all?
>
> P.S: I've also reported this as an issue:
> http://tracker.ceph.com/issues/17194 (not sure if it was right to do :))
>
> --
> With best regards,
>   Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backfilling pgs not making progress

2016-08-11 Thread Samuel Just
I just updated the bug with several questions.
-Sam

On Thu, Aug 11, 2016 at 6:56 AM, Brian Felton <bjfel...@gmail.com> wrote:
> Sam,
>
> I very much appreciate the assistance.  I have opened
> http://tracker.ceph.com/issues/16997 to track this (potential) issue.
>
> Brian
>
> On Wed, Aug 10, 2016 at 1:53 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> Ok, can you
>> 1) Open a bug
>> 2) Identify all osds involved in the 5 problem pgs
>> 3) enable debug osd = 20, debug filestore = 20, debug ms = 1 on all of
>> them
>> 4) mark the primary for each pg down (should cause peering and
>> backfill to restart)
>> 5) link all logs to the bug
>>
>> Thanks!
>> -Sam
>>
>> On Tue, Jul 26, 2016 at 9:11 AM, Samuel Just <sj...@redhat.com> wrote:
>> > Hmm, nvm, it's not an lfn object anyway.
>> > -Sam
>> >
>> > On Tue, Jul 26, 2016 at 7:07 AM, Brian Felton <bjfel...@gmail.com>
>> > wrote:
>> >> If I search on osd.580, I find
>> >>
>> >> default.421929.15\uTEPP\s84316222-6ddd-4ac9-8283-6fa1cdcf9b88\sbackups\s20160630091353\sp1\s\sShares\sWarehouse\sLondonWarehouse\sLondon\sRon
>> >> picture's\sMISCELLANEOUS\s2014\sOct., 2014\sOct.
>> >> 1\sDSC04329.JPG__head_981926C1__21__5, which has a
>> >> non-zero
>> >> size and a hash (981926C1) that matches that of the same file found on
>> >> the
>> >> other OSDs in the pg.
>> >>
>> >> If I'm misunderstanding what you're asking about a dangling link,
>> >> please
>> >> point me in the right direction.
>> >>
>> >> Brian
>> >>
>> >> On Tue, Jul 26, 2016 at 8:59 AM, Samuel Just <sj...@redhat.com> wrote:
>> >>>
>> >>> Did you also confirm that the backfill target does not have any of
>> >>> those dangling links?  I'd be looking for a dangling link for
>> >>>
>> >>>
>> >>> 981926c1/default.421929.15_TEPP/84316222-6ddd-4ac9-8283-6fa1cdcf9b88/backups/20160630091353/p1//Shares/Warehouse/LondonWarehouse/London/Ron
>> >>> picture's/MISCELLANEOUS/2014/Oct., 2014/Oct. 1/DSC04329.JPG/head//33
>> >>> on osd.580.
>> >>> -Sam
>> >>>
>> >>> On Mon, Jul 25, 2016 at 9:04 PM, Brian Felton <bjfel...@gmail.com>
>> >>> wrote:
>> >>> > Sam,
>> >>> >
>> >>> > I cranked up the logging on the backfill target (osd 580 on node 07)
>> >>> > and
>> >>> > the
>> >>> > acting primary for the pg (453 on node 08, for what it's worth).
>> >>> > The
>> >>> > logs
>> >>> > from the primary are very large, so pardon the tarballs.
>> >>> >
>> >>> > PG Primary Logs:
>> >>> >
>> >>> > https://www.dropbox.com/s/ipjobn2i5ban9km/backfill-primary-log.tgz?dl=0B
>> >>> > PG Backfill Target Logs:
>> >>> >
>> >>> > https://www.dropbox.com/s/9qpiqsnahx0qc5k/backfill-target-log.tgz?dl=0
>> >>> >
>> >>> > I'll be reviewing them with my team tomorrow morning to see if we
>> >>> > can
>> >>> > find
>> >>> > anything.  Thanks for your assistance.
>> >>> >
>> >>> > Brian
>> >>> >
>> >>> > On Mon, Jul 25, 2016 at 3:33 PM, Samuel Just <sj...@redhat.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> The next thing I'd want is for you to reproduce with
>> >>> >>
>> >>> >> debug osd = 20
>> >>> >> debug filestore = 20
>> >>> >> debug ms = 1
>> >>> >>
>> >>> >> and post the file somewhere.
>> >>> >> -Sam
>> >>> >>
>> >>> >> On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just <sj...@redhat.com>
>> >>> >> wrote:
>> >>> >> > If you don't have the orphaned file link, it's not the same bug.
>> >>> >> > -Sam
>> >>> >> >
>> >>> >> > On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton
>> >>> >> > <bjfel...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >> Sam,
>> >>> >> >>
>> &g

Re: [ceph-users] Backfilling pgs not making progress

2016-08-10 Thread Samuel Just
Ok, can you
1) Open a bug
2) Identify all osds involved in the 5 problem pgs
3) enable debug osd = 20, debug filestore = 20, debug ms = 1 on all of them
4) mark the primary for each pg down (should cause peering and
backfill to restart)
5) link all logs to the bug

Thanks!
-Sam

On Tue, Jul 26, 2016 at 9:11 AM, Samuel Just <sj...@redhat.com> wrote:
> Hmm, nvm, it's not an lfn object anyway.
> -Sam
>
> On Tue, Jul 26, 2016 at 7:07 AM, Brian Felton <bjfel...@gmail.com> wrote:
>> If I search on osd.580, I find
>> default.421929.15\uTEPP\s84316222-6ddd-4ac9-8283-6fa1cdcf9b88\sbackups\s20160630091353\sp1\s\sShares\sWarehouse\sLondonWarehouse\sLondon\sRon
>> picture's\sMISCELLANEOUS\s2014\sOct., 2014\sOct.
>> 1\sDSC04329.JPG__head_981926C1__21__5, which has a non-zero
>> size and a hash (981926C1) that matches that of the same file found on the
>> other OSDs in the pg.
>>
>> If I'm misunderstanding what you're asking about a dangling link, please
>> point me in the right direction.
>>
>> Brian
>>
>> On Tue, Jul 26, 2016 at 8:59 AM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> Did you also confirm that the backfill target does not have any of
>>> those dangling links?  I'd be looking for a dangling link for
>>>
>>> 981926c1/default.421929.15_TEPP/84316222-6ddd-4ac9-8283-6fa1cdcf9b88/backups/20160630091353/p1//Shares/Warehouse/LondonWarehouse/London/Ron
>>> picture's/MISCELLANEOUS/2014/Oct., 2014/Oct. 1/DSC04329.JPG/head//33
>>> on osd.580.
>>> -Sam
>>>
>>> On Mon, Jul 25, 2016 at 9:04 PM, Brian Felton <bjfel...@gmail.com> wrote:
>>> > Sam,
>>> >
>>> > I cranked up the logging on the backfill target (osd 580 on node 07) and
>>> > the
>>> > acting primary for the pg (453 on node 08, for what it's worth).  The
>>> > logs
>>> > from the primary are very large, so pardon the tarballs.
>>> >
>>> > PG Primary Logs:
>>> > https://www.dropbox.com/s/ipjobn2i5ban9km/backfill-primary-log.tgz?dl=0B
>>> > PG Backfill Target Logs:
>>> > https://www.dropbox.com/s/9qpiqsnahx0qc5k/backfill-target-log.tgz?dl=0
>>> >
>>> > I'll be reviewing them with my team tomorrow morning to see if we can
>>> > find
>>> > anything.  Thanks for your assistance.
>>> >
>>> > Brian
>>> >
>>> > On Mon, Jul 25, 2016 at 3:33 PM, Samuel Just <sj...@redhat.com> wrote:
>>> >>
>>> >> The next thing I'd want is for you to reproduce with
>>> >>
>>> >> debug osd = 20
>>> >> debug filestore = 20
>>> >> debug ms = 1
>>> >>
>>> >> and post the file somewhere.
>>> >> -Sam
>>> >>
>>> >> On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just <sj...@redhat.com> wrote:
>>> >> > If you don't have the orphaned file link, it's not the same bug.
>>> >> > -Sam
>>> >> >
>>> >> > On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton <bjfel...@gmail.com>
>>> >> > wrote:
>>> >> >> Sam,
>>> >> >>
>>> >> >> I'm reviewing that thread now, but I'm not seeing a lot of overlap
>>> >> >> with
>>> >> >> my
>>> >> >> cluster's situation.  For one, I am unable to start either a repair
>>> >> >> or
>>> >> >> a
>>> >> >> deep scrub on any of the affected pgs.  I've instructed all six of
>>> >> >> the
>>> >> >> pgs
>>> >> >> to scrub, deep-scrub, and repair, and the cluster has been gleefully
>>> >> >> ignoring these requests (it has been several hours since I first
>>> >> >> tried,
>>> >> >> and
>>> >> >> the logs indicate none of the pgs ever scrubbed).  Second, none of
>>> >> >> the
>>> >> >> my
>>> >> >> OSDs is crashing.  Third, none of my pgs or objects has ever been
>>> >> >> marked
>>> >> >> inconsistent (or unfound, for that matter) -- I'm only seeing the
>>> >> >> standard
>>> >> >> mix of degraded/misplaced objects that are common during a recovery.
>>> >> >> What
>>> >> >> I'm not seeing is any further progress on the number of misplaced
>>> >&

Re: [ceph-users] [Ceph-community] Noobie question about OSD fail

2016-07-27 Thread Samuel Just
osd min down reports = 2

Set that to 1?
-Sam

On Wed, Jul 27, 2016 at 10:24 AM, Patrick McGarry  wrote:
> Moving this to ceph-user.
>
>
> On Wed, Jul 27, 2016 at 8:36 AM, Kostya Velychkovsky
>  wrote:
>> Hello. I have test CEPH cluster with 5 nodes:  3 MON and 2 OSD
>>
>> This is my ceph.conf
>>
>> [global]
>> fsid = 714da611-2c40-4930-b5b9-d57e70d5cf7e
>> mon_initial_members = node1
>> mon_host = node1,node3,node4
>>
>> auth_cluster_required = cephx
>> auth_service_required = cephx
>> auth_client_required = cephx
>> osd_pool_default_size = 2
>> public_network = X.X.X.X/24
>>
>> [mon]
>> osd report timeout = 15
>> osd min down reports = 2
>>
>> [osd]
>> mon report interval max = 30
>> mon heartbeat interval = 15
>>
>>
>> So, while I run some fail tests and hard reset one OSD node, I have long
>> timeout while ceph mark this OSD down, ~15 minutes
>>
>> and ceph -s display that cluster OK.
>> ---
>> cluster 714da611-2c40-4930-b5b9-d57e70d5cf7e
>>  health HEALTH_OK
>>  monmap e5: 3 mons at 
>> election epoch 272, quorum 0,1,2 node1,node3,node4
>>  osdmap e90: 2 osds: 2 up, 2 in
>> ---
>> Only after ~15 minutes mon nodes Mark this OSD down, and change state of
>> cluster
>> 
>>  osdmap e86: 2 osds: 1 up, 2 in; 64 remapped pgs
>> flags sortbitwise
>>   pgmap v3927: 64 pgs, 1 pools, 10961 MB data, 2752 objects
>> 22039 MB used, 168 GB / 189 GB avail
>> 2752/5504 objects degraded (50.000%)
>>   64 active+undersized+degraded
>> ---
>>
>> I tried to ajust 'osd report timeout'  but have the same result.
>>
>> Can you pls help me tune my cluster to decrease this reaction time ?
>>
>> --
>> Best Regards
>>
>> Kostiantyn Velychkovsky
>>
>> ___
>> Ceph-community mailing list
>> ceph-commun...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>>
>
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Listing objects in a specified placement group / OSD

2016-07-27 Thread Samuel Just
Well, it's kind of deliberately obfuscated because PGs aren't a
librados-level abstraction.  Why do you want to list the objects in a
PG?
-Sam

On Wed, Jul 27, 2016 at 8:10 AM, David Blundell
 wrote:
> Hi,
>
>
>
> I wasn’t sure if this is a ceph-users or ceph-devel question as it’s about
> the API (users) but the answer may involve me writing a RADOS method
> (devel).
>
>
>
> At the moment in Ceph Jewel I can find which objects are held in an OSD or
> placement group by looking on the filesystem under
> /var/lib/ceph/osd/ceph-*/current
>
>
>
> This requires access to the OSD host and may well break when using Bluestore
> if there is no filesystem to look through.  I would like to be able to list
> objects in a specified PG/OSD from outside of the OSD host using Ceph
> commands.
>
>
>
> I can list all PGs hosted on OSD 1 using “ceph pg ls-by-osd osd.1” and could
> loop through this output if there was a way to list the objects in a PG.
>
>
>
> I have checked the API and librados docs (I would be happy to hack something
> together using librados) and can’t see any obvious way to list the objects
> in a PG.
>
>
>
> I have seen a post on this mailing list from Ilya last September saying:
>
> “Internally there is a way to list objects within a specific PG (actually
> more than one way IIRC), but I don't think anything like that is exposed in
> a CLI (it might be exposed in librados though).”
>
>
>
> but could not find any follow up posts with details.
>
>
>
> Does anyone have any more details on these internal methods and how to call
> them?
>
>
>
> Cheers,
>
>
>
> David
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to get Active set of OSD Map in serial order of osd index

2016-07-27 Thread Samuel Just
Think of the osd numbers as names.  The plugin interface doesn't even
tell you which shard maps to which osd.  Why would it make a
difference?
-Sam

On Wed, Jul 27, 2016 at 12:45 AM, Syed Hussain <syed...@gmail.com> wrote:
> Fundamentally, I wanted to know what chunks are allocated in which OSDs.
> This way I can preserve the array structure required for my
> Erasure Code. If all the chunks are placed in randomly ordered OSDs (like in
> Jerasure or ISA) then I loss that array structure required in the
> Encoding/Decoding algorithm of my Plugin.
> I'm trying to develop an Erasure Code plugin for RDP (or RAID-DP) kind of
> code.
>
> Thanks,
> Syed
>
> On Wed, Jul 27, 2016 at 4:12 AM, Samuel Just <sj...@redhat.com> wrote:
>>
>> Why do you want them in serial increasing order?
>> -Sam
>>
>> On Tue, Jul 26, 2016 at 2:43 PM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> How would such a code work if there were more than 24 osds?
>>> -Sam
>>>
>>> On Tue, Jul 26, 2016 at 2:37 PM, Syed Hussain <syed...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm working to develop an Erasure Code plugin (variation of ISA) that
>>>> have typical requirement that the active set of the Erasure Coded pool in
>>>> serial order.
>>>> For example,
>>>>
>>>> 
>>>> >ceph osd erasure-code-profile set reed_k16m8_isa k=16 m=8 plugin=isa
>>>> > technique=reed_sol_van ruleset-failure-domain=osd
>>>> >ceph osd pool create reed_k16m8_isa_pool 128 128 erasure reed_k16m8_isa
>>>> >echo "ABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHI" | rados
>>>> > --pool reed_k16m8_isa_pool put myobj16_8 -
>>>> >ceph osd map reed_k16m8_isa_pool myobj16_8
>>>> osdmap e86 pool 'reed_k16m8_isa_pool' (1) object 'myobj16_8' -> pg
>>>> 1.cf6ec86f (1.6f) -> up
>>>> ([4,23,22,10,9,11,15,6,19,1,7,8,17,21,16,14,18,12,13,20,3,5,0,2], p4) 
>>>> acting
>>>> ([4,23,22,10,9,11,15,6,19,1,7,8,17,21,16,14,18,12,13,20,3,5,0,2], p4)
>>>>
>>>> 
>>>>
>>>> That means the chunks 0, 1, 2, ...23 of the erasure coding are saved int
>>>> osd 4, 23, 22, 10, ...2 respectively as per the order given in the active
>>>> set.
>>>>
>>>> Now my question is how I'll be able to get the PG map for object
>>>> myobj16_8 having active set as: [0, 1, 2, ...23] so that the i-th chunk of
>>>> the Erasure Coded object saves into
>>>> i-th osd.
>>>>
>>>> Is there any option available in "ceph osd pool create" to do it?
>>>> Or there may be other way available to accomplish this case.
>>>>
>>>> Appreciate your suggestions..
>>>>
>>>> Thanks,
>>>> Syed Hussain
>>>> NetWorld
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to get Active set of OSD Map in serial order of osd index

2016-07-26 Thread Samuel Just
Why do you want them in serial increasing order?
-Sam

On Tue, Jul 26, 2016 at 2:43 PM, Samuel Just <sj...@redhat.com> wrote:

> How would such a code work if there were more than 24 osds?
> -Sam
>
> On Tue, Jul 26, 2016 at 2:37 PM, Syed Hussain <syed...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm working to develop an Erasure Code plugin (variation of ISA) that
>> have typical requirement that the active set of the Erasure Coded pool in
>> serial order.
>> For example,
>>
>> 
>> >ceph osd erasure-code-profile set reed_k16m8_isa k=16 m=8 plugin=isa
>> technique=reed_sol_van ruleset-failure-domain=osd
>> >ceph osd pool create reed_k16m8_isa_pool 128 128 erasure reed_k16m8_isa
>> >echo "ABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHI" | rados
>> --pool reed_k16m8_isa_pool put myobj16_8 -
>> >ceph osd map reed_k16m8_isa_pool myobj16_8
>> osdmap e86 pool 'reed_k16m8_isa_pool' (1) object 'myobj16_8' -> pg
>> 1.cf6ec86f (1.6f) -> up
>> ([4,23,22,10,9,11,15,6,19,1,7,8,17,21,16,14,18,12,13,20,3,5,0,2], p4)
>> acting ([4,23,22,10,9,11,15,6,19,1,7,8,17,21,16,14,18,12,13,20,3,5,0,2], p4)
>>
>> 
>>
>> That means the chunks 0, 1, 2, ...23 of the erasure coding are saved int
>> osd 4, 23, 22, 10, ...2 respectively as per the order given in the active
>> set.
>>
>> Now my question is how I'll be able to get the PG map for object
>> myobj16_8 having active set as: [0, 1, 2, ...23] so that the i-th chunk of
>> the Erasure Coded object saves into
>> i-th osd.
>>
>> Is there any option available in "ceph osd pool create" to do it?
>> Or there may be other way available to accomplish this case.
>>
>> Appreciate your suggestions..
>>
>> Thanks,
>> Syed Hussain
>> NetWorld
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to get Active set of OSD Map in serial order of osd index

2016-07-26 Thread Samuel Just
How would such a code work if there were more than 24 osds?
-Sam

On Tue, Jul 26, 2016 at 2:37 PM, Syed Hussain  wrote:

> Hi,
>
> I'm working to develop an Erasure Code plugin (variation of ISA) that have
> typical requirement that the active set of the Erasure Coded pool in serial
> order.
> For example,
>
> 
> >ceph osd erasure-code-profile set reed_k16m8_isa k=16 m=8 plugin=isa
> technique=reed_sol_van ruleset-failure-domain=osd
> >ceph osd pool create reed_k16m8_isa_pool 128 128 erasure reed_k16m8_isa
> >echo "ABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHI" | rados
> --pool reed_k16m8_isa_pool put myobj16_8 -
> >ceph osd map reed_k16m8_isa_pool myobj16_8
> osdmap e86 pool 'reed_k16m8_isa_pool' (1) object 'myobj16_8' -> pg
> 1.cf6ec86f (1.6f) -> up
> ([4,23,22,10,9,11,15,6,19,1,7,8,17,21,16,14,18,12,13,20,3,5,0,2], p4)
> acting ([4,23,22,10,9,11,15,6,19,1,7,8,17,21,16,14,18,12,13,20,3,5,0,2], p4)
>
> 
>
> That means the chunks 0, 1, 2, ...23 of the erasure coding are saved int
> osd 4, 23, 22, 10, ...2 respectively as per the order given in the active
> set.
>
> Now my question is how I'll be able to get the PG map for object myobj16_8
> having active set as: [0, 1, 2, ...23] so that the i-th chunk of the
> Erasure Coded object saves into
> i-th osd.
>
> Is there any option available in "ceph osd pool create" to do it?
> Or there may be other way available to accomplish this case.
>
> Appreciate your suggestions..
>
> Thanks,
> Syed Hussain
> NetWorld
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backfilling pgs not making progress

2016-07-26 Thread Samuel Just
Hmm, nvm, it's not an lfn object anyway.
-Sam

On Tue, Jul 26, 2016 at 7:07 AM, Brian Felton <bjfel...@gmail.com> wrote:
> If I search on osd.580, I find
> default.421929.15\uTEPP\s84316222-6ddd-4ac9-8283-6fa1cdcf9b88\sbackups\s20160630091353\sp1\s\sShares\sWarehouse\sLondonWarehouse\sLondon\sRon
> picture's\sMISCELLANEOUS\s2014\sOct., 2014\sOct.
> 1\sDSC04329.JPG__head_981926C1__21__5, which has a non-zero
> size and a hash (981926C1) that matches that of the same file found on the
> other OSDs in the pg.
>
> If I'm misunderstanding what you're asking about a dangling link, please
> point me in the right direction.
>
> Brian
>
> On Tue, Jul 26, 2016 at 8:59 AM, Samuel Just <sj...@redhat.com> wrote:
>>
>> Did you also confirm that the backfill target does not have any of
>> those dangling links?  I'd be looking for a dangling link for
>>
>> 981926c1/default.421929.15_TEPP/84316222-6ddd-4ac9-8283-6fa1cdcf9b88/backups/20160630091353/p1//Shares/Warehouse/LondonWarehouse/London/Ron
>> picture's/MISCELLANEOUS/2014/Oct., 2014/Oct. 1/DSC04329.JPG/head//33
>> on osd.580.
>> -Sam
>>
>> On Mon, Jul 25, 2016 at 9:04 PM, Brian Felton <bjfel...@gmail.com> wrote:
>> > Sam,
>> >
>> > I cranked up the logging on the backfill target (osd 580 on node 07) and
>> > the
>> > acting primary for the pg (453 on node 08, for what it's worth).  The
>> > logs
>> > from the primary are very large, so pardon the tarballs.
>> >
>> > PG Primary Logs:
>> > https://www.dropbox.com/s/ipjobn2i5ban9km/backfill-primary-log.tgz?dl=0B
>> > PG Backfill Target Logs:
>> > https://www.dropbox.com/s/9qpiqsnahx0qc5k/backfill-target-log.tgz?dl=0
>> >
>> > I'll be reviewing them with my team tomorrow morning to see if we can
>> > find
>> > anything.  Thanks for your assistance.
>> >
>> > Brian
>> >
>> > On Mon, Jul 25, 2016 at 3:33 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>
>> >> The next thing I'd want is for you to reproduce with
>> >>
>> >> debug osd = 20
>> >> debug filestore = 20
>> >> debug ms = 1
>> >>
>> >> and post the file somewhere.
>> >> -Sam
>> >>
>> >> On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just <sj...@redhat.com> wrote:
>> >> > If you don't have the orphaned file link, it's not the same bug.
>> >> > -Sam
>> >> >
>> >> > On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton <bjfel...@gmail.com>
>> >> > wrote:
>> >> >> Sam,
>> >> >>
>> >> >> I'm reviewing that thread now, but I'm not seeing a lot of overlap
>> >> >> with
>> >> >> my
>> >> >> cluster's situation.  For one, I am unable to start either a repair
>> >> >> or
>> >> >> a
>> >> >> deep scrub on any of the affected pgs.  I've instructed all six of
>> >> >> the
>> >> >> pgs
>> >> >> to scrub, deep-scrub, and repair, and the cluster has been gleefully
>> >> >> ignoring these requests (it has been several hours since I first
>> >> >> tried,
>> >> >> and
>> >> >> the logs indicate none of the pgs ever scrubbed).  Second, none of
>> >> >> the
>> >> >> my
>> >> >> OSDs is crashing.  Third, none of my pgs or objects has ever been
>> >> >> marked
>> >> >> inconsistent (or unfound, for that matter) -- I'm only seeing the
>> >> >> standard
>> >> >> mix of degraded/misplaced objects that are common during a recovery.
>> >> >> What
>> >> >> I'm not seeing is any further progress on the number of misplaced
>> >> >> objects --
>> >> >> the number has remained effectively unchanged for the past several
>> >> >> days.
>> >> >>
>> >> >> To be sure, though, I tracked down the file that the backfill
>> >> >> operation
>> >> >> seems to be hung on, and I can find it in both the backfill target
>> >> >> osd
>> >> >> (580)
>> >> >> and a few other osds in the pg.  In all cases, I was able to find
>> >> >> the
>> >> >> file
>> >> >> with an identical hash value on all nodes, and I didn't find any
>> 

Re: [ceph-users] Backfilling pgs not making progress

2016-07-26 Thread Samuel Just
Did you also confirm that the backfill target does not have any of
those dangling links?  I'd be looking for a dangling link for
981926c1/default.421929.15_TEPP/84316222-6ddd-4ac9-8283-6fa1cdcf9b88/backups/20160630091353/p1//Shares/Warehouse/LondonWarehouse/London/Ron
picture's/MISCELLANEOUS/2014/Oct., 2014/Oct. 1/DSC04329.JPG/head//33
on osd.580.
-Sam

On Mon, Jul 25, 2016 at 9:04 PM, Brian Felton <bjfel...@gmail.com> wrote:
> Sam,
>
> I cranked up the logging on the backfill target (osd 580 on node 07) and the
> acting primary for the pg (453 on node 08, for what it's worth).  The logs
> from the primary are very large, so pardon the tarballs.
>
> PG Primary Logs:
> https://www.dropbox.com/s/ipjobn2i5ban9km/backfill-primary-log.tgz?dl=0B
> PG Backfill Target Logs:
> https://www.dropbox.com/s/9qpiqsnahx0qc5k/backfill-target-log.tgz?dl=0
>
> I'll be reviewing them with my team tomorrow morning to see if we can find
> anything.  Thanks for your assistance.
>
> Brian
>
> On Mon, Jul 25, 2016 at 3:33 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> The next thing I'd want is for you to reproduce with
>>
>> debug osd = 20
>> debug filestore = 20
>> debug ms = 1
>>
>> and post the file somewhere.
>> -Sam
>>
>> On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just <sj...@redhat.com> wrote:
>> > If you don't have the orphaned file link, it's not the same bug.
>> > -Sam
>> >
>> > On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton <bjfel...@gmail.com>
>> > wrote:
>> >> Sam,
>> >>
>> >> I'm reviewing that thread now, but I'm not seeing a lot of overlap with
>> >> my
>> >> cluster's situation.  For one, I am unable to start either a repair or
>> >> a
>> >> deep scrub on any of the affected pgs.  I've instructed all six of the
>> >> pgs
>> >> to scrub, deep-scrub, and repair, and the cluster has been gleefully
>> >> ignoring these requests (it has been several hours since I first tried,
>> >> and
>> >> the logs indicate none of the pgs ever scrubbed).  Second, none of the
>> >> my
>> >> OSDs is crashing.  Third, none of my pgs or objects has ever been
>> >> marked
>> >> inconsistent (or unfound, for that matter) -- I'm only seeing the
>> >> standard
>> >> mix of degraded/misplaced objects that are common during a recovery.
>> >> What
>> >> I'm not seeing is any further progress on the number of misplaced
>> >> objects --
>> >> the number has remained effectively unchanged for the past several
>> >> days.
>> >>
>> >> To be sure, though, I tracked down the file that the backfill operation
>> >> seems to be hung on, and I can find it in both the backfill target osd
>> >> (580)
>> >> and a few other osds in the pg.  In all cases, I was able to find the
>> >> file
>> >> with an identical hash value on all nodes, and I didn't find any
>> >> duplicates
>> >> or potential orphans.  Also, none of the objects involves have long
>> >> names,
>> >> so they're not using the special ceph long filename handling.
>> >>
>> >> Also, we are not using XFS on our OSDs; we are using ZFS instead.
>> >>
>> >> If I'm misunderstanding the issue linked above and the corresponding
>> >> thread,
>> >> please let me know.
>> >>
>> >> Brian
>> >>
>> >>
>> >> On Mon, Jul 25, 2016 at 1:32 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>>
>> >>> You may have hit http://tracker.ceph.com/issues/14766.  There was a
>> >>> thread on the list a while back about diagnosing and fixing it.
>> >>> -Sam
>> >>>
>> >>> On Mon, Jul 25, 2016 at 10:45 AM, Brian Felton <bjfel...@gmail.com>
>> >>> wrote:
>> >>> > Greetings,
>> >>> >
>> >>> > Problem: After removing (out + crush remove + auth del + osd rm)
>> >>> > three
>> >>> > osds
>> >>> > on a single host, I have six pgs that, after 10 days of recovery,
>> >>> > are
>> >>> > stuck
>> >>> > in a state of active+undersized+degraded+remapped+backfilling.
>> >>> >
>> >>> > Cluster details:
>> >>> >  - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04, 72 6TB SAS2 drives
>> >>> &g

Re: [ceph-users] Backfilling pgs not making progress

2016-07-25 Thread Samuel Just
If you don't have the orphaned file link, it's not the same bug.
-Sam

On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton <bjfel...@gmail.com> wrote:
> Sam,
>
> I'm reviewing that thread now, but I'm not seeing a lot of overlap with my
> cluster's situation.  For one, I am unable to start either a repair or a
> deep scrub on any of the affected pgs.  I've instructed all six of the pgs
> to scrub, deep-scrub, and repair, and the cluster has been gleefully
> ignoring these requests (it has been several hours since I first tried, and
> the logs indicate none of the pgs ever scrubbed).  Second, none of the my
> OSDs is crashing.  Third, none of my pgs or objects has ever been marked
> inconsistent (or unfound, for that matter) -- I'm only seeing the standard
> mix of degraded/misplaced objects that are common during a recovery.  What
> I'm not seeing is any further progress on the number of misplaced objects --
> the number has remained effectively unchanged for the past several days.
>
> To be sure, though, I tracked down the file that the backfill operation
> seems to be hung on, and I can find it in both the backfill target osd (580)
> and a few other osds in the pg.  In all cases, I was able to find the file
> with an identical hash value on all nodes, and I didn't find any duplicates
> or potential orphans.  Also, none of the objects involves have long names,
> so they're not using the special ceph long filename handling.
>
> Also, we are not using XFS on our OSDs; we are using ZFS instead.
>
> If I'm misunderstanding the issue linked above and the corresponding thread,
> please let me know.
>
> Brian
>
>
> On Mon, Jul 25, 2016 at 1:32 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> You may have hit http://tracker.ceph.com/issues/14766.  There was a
>> thread on the list a while back about diagnosing and fixing it.
>> -Sam
>>
>> On Mon, Jul 25, 2016 at 10:45 AM, Brian Felton <bjfel...@gmail.com> wrote:
>> > Greetings,
>> >
>> > Problem: After removing (out + crush remove + auth del + osd rm) three
>> > osds
>> > on a single host, I have six pgs that, after 10 days of recovery, are
>> > stuck
>> > in a state of active+undersized+degraded+remapped+backfilling.
>> >
>> > Cluster details:
>> >  - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04, 72 6TB SAS2 drives per
>> > host,
>> > collocated journals) -- one host now has 69 drives
>> >  - Hammer 0.94.6
>> >  - object storage use only
>> >  - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)
>> >  - failure domain of host
>> >  - cluster is currently storing 178TB over 260 MObjects (5-6%
>> > utilization
>> > per OSD)
>> >  - all 6 stuck pgs belong to .rgw.buckets
>> >
>> > The relevant section of our crushmap:
>> >
>> > rule .rgw.buckets {
>> > ruleset 1
>> > type erasure
>> > min_size 7
>> > max_size 9
>> > step set_chooseleaf_tries 5
>> > step set_choose_tries 250
>> > step take default
>> > step chooseleaf indep 0 type host
>> > step emit
>> > }
>> >
>> > This isn't the first time we've lost a disk (not even the first time
>> > we've
>> > lost multiple disks on a host in a single event), so we're used to the
>> > extended recovery times and understand this is going to be A Thing until
>> > we
>> > can introduce SSD journals.  This is, however, the first time we've had
>> > pgs
>> > not return to an active+clean state after a couple days.  As far as I
>> > can
>> > tell, our cluster is no longer making progress on the backfill
>> > operations,
>> > and I'm looking for advice on how to get things moving again.
>> >
>> > Here's a dump of the stuck pgs:
>> >
>> > ceph pg dump_stuck
>> > ok
>> > pg_stat state   up  up_primary  acting  acting_primary
>> > 33.151d active+undersized+degraded+remapped+backfilling
>> > [424,546,273,167,471,631,155,38,47] 424
>> > [424,546,273,167,471,631,155,38,2147483647] 424
>> > 33.6c1  active+undersized+degraded+remapped+backfilling
>> > [453,86,565,266,338,580,297,577,404]453
>> > [453,86,565,266,338,2147483647,297,577,404] 453
>> > 33.17b7 active+undersized+degraded+remapped+backfilling
>> > [399,432,437,541,547,219,229,104,47]399
>> > [399,432,437,541,547,219,229,104,2147483647]399
>> > 33.150d active+undersized+degraded+r

Re: [ceph-users] Backfilling pgs not making progress

2016-07-25 Thread Samuel Just
The next thing I'd want is for you to reproduce with

debug osd = 20
debug filestore = 20
debug ms = 1

and post the file somewhere.
-Sam

On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just <sj...@redhat.com> wrote:
> If you don't have the orphaned file link, it's not the same bug.
> -Sam
>
> On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton <bjfel...@gmail.com> wrote:
>> Sam,
>>
>> I'm reviewing that thread now, but I'm not seeing a lot of overlap with my
>> cluster's situation.  For one, I am unable to start either a repair or a
>> deep scrub on any of the affected pgs.  I've instructed all six of the pgs
>> to scrub, deep-scrub, and repair, and the cluster has been gleefully
>> ignoring these requests (it has been several hours since I first tried, and
>> the logs indicate none of the pgs ever scrubbed).  Second, none of the my
>> OSDs is crashing.  Third, none of my pgs or objects has ever been marked
>> inconsistent (or unfound, for that matter) -- I'm only seeing the standard
>> mix of degraded/misplaced objects that are common during a recovery.  What
>> I'm not seeing is any further progress on the number of misplaced objects --
>> the number has remained effectively unchanged for the past several days.
>>
>> To be sure, though, I tracked down the file that the backfill operation
>> seems to be hung on, and I can find it in both the backfill target osd (580)
>> and a few other osds in the pg.  In all cases, I was able to find the file
>> with an identical hash value on all nodes, and I didn't find any duplicates
>> or potential orphans.  Also, none of the objects involves have long names,
>> so they're not using the special ceph long filename handling.
>>
>> Also, we are not using XFS on our OSDs; we are using ZFS instead.
>>
>> If I'm misunderstanding the issue linked above and the corresponding thread,
>> please let me know.
>>
>> Brian
>>
>>
>> On Mon, Jul 25, 2016 at 1:32 PM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> You may have hit http://tracker.ceph.com/issues/14766.  There was a
>>> thread on the list a while back about diagnosing and fixing it.
>>> -Sam
>>>
>>> On Mon, Jul 25, 2016 at 10:45 AM, Brian Felton <bjfel...@gmail.com> wrote:
>>> > Greetings,
>>> >
>>> > Problem: After removing (out + crush remove + auth del + osd rm) three
>>> > osds
>>> > on a single host, I have six pgs that, after 10 days of recovery, are
>>> > stuck
>>> > in a state of active+undersized+degraded+remapped+backfilling.
>>> >
>>> > Cluster details:
>>> >  - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04, 72 6TB SAS2 drives per
>>> > host,
>>> > collocated journals) -- one host now has 69 drives
>>> >  - Hammer 0.94.6
>>> >  - object storage use only
>>> >  - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)
>>> >  - failure domain of host
>>> >  - cluster is currently storing 178TB over 260 MObjects (5-6%
>>> > utilization
>>> > per OSD)
>>> >  - all 6 stuck pgs belong to .rgw.buckets
>>> >
>>> > The relevant section of our crushmap:
>>> >
>>> > rule .rgw.buckets {
>>> > ruleset 1
>>> > type erasure
>>> > min_size 7
>>> > max_size 9
>>> > step set_chooseleaf_tries 5
>>> > step set_choose_tries 250
>>> > step take default
>>> > step chooseleaf indep 0 type host
>>> > step emit
>>> > }
>>> >
>>> > This isn't the first time we've lost a disk (not even the first time
>>> > we've
>>> > lost multiple disks on a host in a single event), so we're used to the
>>> > extended recovery times and understand this is going to be A Thing until
>>> > we
>>> > can introduce SSD journals.  This is, however, the first time we've had
>>> > pgs
>>> > not return to an active+clean state after a couple days.  As far as I
>>> > can
>>> > tell, our cluster is no longer making progress on the backfill
>>> > operations,
>>> > and I'm looking for advice on how to get things moving again.
>>> >
>>> > Here's a dump of the stuck pgs:
>>> >
>>> > ceph pg dump_stuck
>>> > ok
>>> > pg_stat state   up  up_primary  acting  acting_primary
>&

Re: [ceph-users] Backfilling pgs not making progress

2016-07-25 Thread Samuel Just
You may have hit http://tracker.ceph.com/issues/14766.  There was a
thread on the list a while back about diagnosing and fixing it.
-Sam

On Mon, Jul 25, 2016 at 10:45 AM, Brian Felton  wrote:
> Greetings,
>
> Problem: After removing (out + crush remove + auth del + osd rm) three osds
> on a single host, I have six pgs that, after 10 days of recovery, are stuck
> in a state of active+undersized+degraded+remapped+backfilling.
>
> Cluster details:
>  - 9 hosts (32 cores, 256 GB RAM, Ubuntu 14.04, 72 6TB SAS2 drives per host,
> collocated journals) -- one host now has 69 drives
>  - Hammer 0.94.6
>  - object storage use only
>  - erasure coded (k=7, m=2) .rgw.buckets pool (8192 pgs)
>  - failure domain of host
>  - cluster is currently storing 178TB over 260 MObjects (5-6% utilization
> per OSD)
>  - all 6 stuck pgs belong to .rgw.buckets
>
> The relevant section of our crushmap:
>
> rule .rgw.buckets {
> ruleset 1
> type erasure
> min_size 7
> max_size 9
> step set_chooseleaf_tries 5
> step set_choose_tries 250
> step take default
> step chooseleaf indep 0 type host
> step emit
> }
>
> This isn't the first time we've lost a disk (not even the first time we've
> lost multiple disks on a host in a single event), so we're used to the
> extended recovery times and understand this is going to be A Thing until we
> can introduce SSD journals.  This is, however, the first time we've had pgs
> not return to an active+clean state after a couple days.  As far as I can
> tell, our cluster is no longer making progress on the backfill operations,
> and I'm looking for advice on how to get things moving again.
>
> Here's a dump of the stuck pgs:
>
> ceph pg dump_stuck
> ok
> pg_stat state   up  up_primary  acting  acting_primary
> 33.151d active+undersized+degraded+remapped+backfilling
> [424,546,273,167,471,631,155,38,47] 424
> [424,546,273,167,471,631,155,38,2147483647] 424
> 33.6c1  active+undersized+degraded+remapped+backfilling
> [453,86,565,266,338,580,297,577,404]453
> [453,86,565,266,338,2147483647,297,577,404] 453
> 33.17b7 active+undersized+degraded+remapped+backfilling
> [399,432,437,541,547,219,229,104,47]399
> [399,432,437,541,547,219,229,104,2147483647]399
> 33.150d active+undersized+degraded+remapped+backfilling
> [555,452,511,550,643,431,141,329,486]   555
> [555,2147483647,511,550,643,431,141,329,486]555
> 33.13a8 active+undersized+degraded+remapped+backfilling
> [507,317,276,617,565,28,471,200,382]507
> [507,2147483647,276,617,565,28,471,200,382] 507
> 33.4c1  active+undersized+degraded+remapped+backfilling
> [413,440,464,129,641,416,295,266,431]   413
> [413,440,2147483647,129,641,416,295,266,431]413
>
> Based on a review of previous postings about this issue, I initially
> suspected that crush couldn't map the pg to an OSD (based on MAX_INT in the
> acting list), so I increased set_choose_tries from 50 to 200, and then again
> to 250 just to see if it would do anything.  These changes had no effect
> that I could discern.
>
> I next reviewed the output of ceph pg  query, and I see something
> similar to the following for each of my stuck pgs:
>
> {
> "state": "active+undersized+degraded+remapped+backfilling",
> "snap_trimq": "[]",
> "epoch": 25211,
> "up": [
> 453,
> 86,
> 565,
> 266,
> 338,
> 580,
> 297,
> 577,
> 404
> ],
> "acting": [
> 453,
> 86,
> 565,
> 266,
> 338,
> 2147483647,
> 297,
> 577,
> 404
> ],
> "backfill_targets": [
> "580(5)"
> ],
> "actingbackfill": [
> "86(1)",
> "266(3)",
> "297(6)",
> "338(4)",
> "404(8)",
> "453(0)",
> "565(2)",
> "577(7)",
> "580(5)"
> ]
>
> In this case, 580 is a valid OSD on the node that lost the 3 OSDs (node 7).
> For the other five pgs, the situation is the same -- the backfill target is
> a valid OSD on node 7.
>
> If I dig further into the 'query' output, I encounter the following:
>
> "recovery_state": [
> {
> "name": "Started\/Primary\/Active",
> "enter_time": "2016-07-24 18:52:51.653375",
> "might_have_unfound": [],
> "recovery_progress": {
> "backfill_targets": [
> "580(5)"
> ],
> "waiting_on_backfill": [
> "580(5)"
> ],
> "last_backfill_started":
> "981926c1\/default.421929.15_MY_OBJECT",
> "backfill_info": {
> "begin":
> "391926c1\/default.9468.416_0080a34a\/head\/\/33",
> "end":
> 

Re: [ceph-users] ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1: (13) Permission denied

2016-07-06 Thread Samuel Just
Try strace.
-Sam

On Wed, Jul 6, 2016 at 7:53 AM, RJ Nowling  wrote:
> Hi all,
>
> I'm trying to use the ceph-ansible playbooks to deploy onto baremetal.  I'm
> currently testing with the approach that uses an existing filesystem rather
> than giving raw access to the disks.
>
> The OSDs are failing to start:
>
> 2016-07-06 10:48:50.249976 7fa410aef800  0 set uid:gid to 167:167
> (ceph:ceph)
> 2016-07-06 10:48:50.250021 7fa410aef800  0 ceph version 10.2.2
> (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 19104
> 2016-07-06 10:48:50.250221 7fa410aef800 -1  ** ERROR: unable to open OSD
> superblock on /var/lib/ceph/osd/ceph-1: (13) Permission denied
>
> The directories seem to be owned by the ceph user and group.  That said, I'm
> putting my data under /home/ceph-data (owned by ceph:ceph) which is on a
> different filesystem with /var/lib/ceph/osd/ceph-1 being a symlink (owned by
> ceph:ceph) .
>
> Any suggestions as to where the permission denied might be coming from?
>
> Thanks!
> RJ
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow request, waiting for rw locks / subops from osd doing deep scrub of pg in rgw.buckets.index

2016-06-21 Thread Samuel Just
.rgw.bucket.index.pool is the pool with rgw's index objects, right?
The actual on-disk directory for one of those pgs would contain only
empty files -- the actual index data is stored in the osd's leveldb
instance.  I suspect your index objects are very large (because the
buckets contain many objects) and are taking a long time to scrub.
iirc, there is a way to make rgw split up those index objects into
smaller ones.
-Sam

On Tue, Jun 21, 2016 at 11:58 AM, Trygve Vea
 wrote:
> Hi,
>
> I believe I've stumbled on a bug in Ceph, and I'm currently trying to figure 
> out if this is a new bug, some behaviour caused by our cluster being in the 
> midst of a hammer(0.94.6)->jewel(10.2.2) upgrade, or other factors.
>
> The state of the cluster at the time of the incident:
>
> - All monitor nodes are running 10.2.2.
> - One OSD-server (4 osds) is up with 10.2.2 and with all pg's in active+clean.
> - One OSD-server (4 osds) is up with 10.2.2 and undergoing backfills 
> (however: nobackfill was set, as we try to keep backfills running during 
> night time).
>
> We have 4 OSD-servers with 4 osds each with 0.94.6.
> We have 3 OSD-servers with 2 osds each with 0.94.6.
>
>
> We experienced something that heavily affected our RGW-users.  Some requests 
> interfacing with 0.94.6 nodes were slow.
>
> During a 10 minute window, our RGW-nodes ran out of available workers and 
> ceased to respond.
>
> Some nodes logged some lines like these (only 0.94.6 nodes):
>
> 2016-06-21 09:51:08.053886 7f54610d8700  0 log_channel(cluster) log [WRN] : 2 
> slow requests, 1 included below; oldest blocked for > 74.368036 secs
> 2016-06-21 09:51:08.053951 7f54610d8700  0 log_channel(cluster) log [WRN] : 
> slow request 30.056333 seconds old, received at 2016-06-21 09:50:37.997327: 
> osd_op(client.9433496.0:1089298249 somergwuser.buckets [call 
> user.set_buckets_info] 12.da8df901 ondisk+write+known_if_redirected e9906) 
> currently waiting for rw locks
>
>
> Some nodes logged some lines like these (there were some, but not 100% 
> overlap between osds that logged these and the beforementioned lines - only 
> 0.94.6 nodes):
>
> 2016-06-21 09:51:48.677474 7f8cb6628700  0 log_channel(cluster) log [WRN] : 2 
> slow requests, 1 included below; oldest blocked for > 42.033650 secs
> 2016-06-21 09:51:48.677565 7f8cb6628700  0 log_channel(cluster) log [WRN] : 
> slow request 30.371173 seconds old, received at 2016-06-21 09:51:18.305770: 
> osd_op(client.9525441.0:764274789 gc.1164 [call lock.lock] 7.7b4f1779 
> ondisk+write+known_if_redirected e9906) currently waiting for subops from 
> 40,50
>
> All of the osds that logged these lines, were waiting for subops from osd.50
>
>
> Investigating what's going on this osd during that window:
>
> 2016-06-21 09:48:22.064630 7f1cbb41d700  0 log_channel(cluster) log [INF] : 
> 5.b5 deep-scrub starts
> 2016-06-21 09:59:56.640012 7f1c90163700  0 -- 10.21.9.22:6800/2003521 >> 
> 10.20.9.21:6805/7755 pipe(0x1e47a000 sd=298 :39448 s=2 pgs=23 cs=1 l=0 
> c=0x1033ba20).fault with nothing to send, going to standby
> 2016-06-21 09:59:56.997763 7f1c700f8700  0 -- 10.21.9.22:6808/3521 >> 
> 10.21.9.12:0/1028533 pipe(0x1f30f000 sd=87 :6808 s=0 pgs=0 cs=0 l=1 
> c=0x743c840).accept replacing existing (lossy) channel (new one lossy=1)
> 2016-06-21 10:00:39.938700 7f1cd9828700  0 log_channel(cluster) log [WRN] : 
> 33 slow requests, 33 included below; oldest blocked for > 727.862759 secs
> 2016-06-21 10:00:39.938708 7f1cd9828700  0 log_channel(cluster) log [WRN] : 
> slow request 670.918857 seconds old, received at 2016-06-21 09:49:29.019653: 
> osd_op(client.9403437.0:1209613500 TZ1A91MYDE1LO63AQCM3 [getxattrs,stat] 
> 9.442585e6 ack+read+known_if_redirected e9906) currently no flag points 
> reached
> 2016-06-21 10:00:39.938800 7f1cd9828700  0 log_channel(cluster) log [WRN] : 
> slow request 689.815851 seconds old, received at 2016-06-21 09:49:10.122660: 
> osd_op(client.9403437.0:1209611533 TZ1A91MYDE1LO63AQCM3 [getxattrs,stat] 
> 9.442585e6 ack+read+known_if_redirected e9906) currently no flag points 
> reached
> 2016-06-21 10:00:39.938807 7f1cd9828700  0 log_channel(cluster) log [WRN] : 
> slow request 670.895353 seconds old, received at 2016-06-21 09:49:29.043158: 
> osd_op(client.9403437.0:1209613505 prod.arkham [call 
> version.read,getxattrs,stat] 2.4da23de6 ack+read+known_if_redirected e9906) 
> currently no flag points reached
> 2016-06-21 10:00:39.938810 7f1cd9828700  0 log_channel(cluster) log [WRN] : 
> slow request 688.612303 seconds old, received at 2016-06-21 09:49:11.326207: 
> osd_op(client.20712623.0:137251515 TZ1A91MYDE1LO63AQCM3 [getxattrs,stat] 
> 9.442585e6 ack+read+known_if_redirected e9906) currently no flag points 
> reached
> 2016-06-21 10:00:39.938813 7f1cd9828700  0 log_channel(cluster) log [WRN] : 
> slow request 658.605163 seconds old, received at 2016-06-21 09:49:41.48: 
> osd_op(client.20712623.0:137254412 TZ1A91MYDE1LO63AQCM3 [getxattrs,stat] 
> 

Re: [ceph-users] Fio randwrite does not work on Centos 7.2 VM

2016-06-15 Thread Samuel Just
I think you hit the os process fd limit.  You need to adjust it.
-Sam

On Wed, Jun 15, 2016 at 2:07 PM, Mansour Shafaei Moghaddam
 wrote:
> It fails at "FileStore.cc: 2761". Here is a more complete log:
>
> -9> 2016-06-15 10:55:13.205014 7fa2dcd85700 -1 dump_open_fds unable to
> open /proc/self/fd
> -8> 2016-06-15 10:55:13.205085 7fa2cb402700  2
> filestore(/var/lib/ceph/osd/ceph-0) waiting 51 > 50 ops || 328390 >
> 104857600
> -7> 2016-06-15 10:55:13.205094 7fa2cd406700  2
> filestore(/var/lib/ceph/osd/ceph-0) waiting 51 > 50 ops || 328389 >
> 104857600
> -6> 2016-06-15 10:55:13.205111 7fa2cac01700  2
> filestore(/var/lib/ceph/osd/ceph-0) waiting 51 > 50 ops || 328317 >
> 104857600
> -5> 2016-06-15 10:55:13.205118 7fa2ca400700  2
> filestore(/var/lib/ceph/osd/ceph-0) waiting 51 > 50 ops || 328390 >
> 104857600
> -4> 2016-06-15 10:55:13.205121 7fa2cdc07700  2
> filestore(/var/lib/ceph/osd/ceph-0) waiting 51 > 50 ops || 328390 >
> 104857600
> -3> 2016-06-15 10:55:13.205153 7fa2de588700  5 -- op tracker -- seq:
> 1476, time: 2016-06-15 10:55:13.205153, event: journaled_completion_queued,
> op: osd_op(client.4109.0:1457 rb.0.100a.6b8b4567.6b6c
> [set-alloc-hint object_size 4194304 write_size 4194304,write 1884160~4096]
> 0.cbe1d8a4 ack+ondisk+write e9)
> -2> 2016-06-15 10:55:13.205183 7fa2de588700  5 -- op tracker -- seq:
> 1483, time: 2016-06-15 10:55:13.205183, event:
> write_thread_in_journal_buffer, op: osd_op(client.4109.0:1464
> rb.0.100a.6b8b4567.524d [set-alloc-hint object_size 4194304
> write_size 4194304,write 3051520~4096] 0.6778c255 ack+ondisk+write e9)
> -1> 2016-06-15 10:55:13.205400 7fa2de588700  5 -- op tracker -- seq:
> 1483, time: 2016-06-15 10:55:13.205400, event: journaled_completion_queued,
> op: osd_op(client.4109.0:1464 rb.0.100a.6b8b4567.524d
> [set-alloc-hint object_size 4194304 write_size 4194304,write 3051520~4096]
> 0.6778c255 ack+ondisk+write e9)
>  0> 2016-06-15 10:55:13.206559 7fa2dcd85700 -1 os/FileStore.cc: In
> function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&,
> uint64_t, int, ThreadPool::TPHandle*)' thread 7fa2dcd85700 time 2016-06-15
> 10:55:13.205018
> os/FileStore.cc: 2761: FAILED assert(0 == "unexpected error")
>
>  ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x78) [0xacd718]
>  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long,
> int, ThreadPool::TPHandle*)+0xa24) [0x8b8114]
>  3: (FileStore::_do_transactions(std::list std::allocator >&, unsigned long,
> ThreadPool::TPHandle*)+0x64) [0x8bcf34]
>  4: (FileStore::_do_op(FileStore::OpSequencer*,
> ThreadPool::TPHandle&)+0x17e) [0x8bd0ce]
>  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0xabe326]
>  6: (ThreadPool::WorkThread::entry()+0x10) [0xabf3d0]
>  7: (()+0x7dc5) [0x7fa2e88f3dc5]
>  8: (clone()+0x6d) [0x7fa2e73d528d]
>
>
> On Wed, Jun 15, 2016 at 2:05 PM, Somnath Roy 
> wrote:
>>
>> There should be a line in the log specifying which assert is failing ,
>> post that along with say 10 lines from top of that..
>>
>>
>>
>> Thanks & Regards
>>
>> Somnath
>>
>>
>>
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Mansour Shafaei Moghaddam
>> Sent: Wednesday, June 15, 2016 1:57 PM
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users] Fio randwrite does not work on Centos 7.2 VM
>>
>>
>>
>> Hi All,
>>
>>
>>
>> Has anyone faced a similar issue? I do not have a problem with random
>> read, sequential read, and sequential writes though. Everytime I try running
>> fio for random writes, one osd in the cluster crashes. Here is the what I
>> see at the tail of the log:
>>
>>
>>
>>  ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>>
>>  1: ceph-osd() [0x9d6334]
>>
>>  2: (()+0xf100) [0x7fa2e88fb100]
>>
>>  3: (gsignal()+0x37) [0x7fa2e73145f7]
>>
>>  4: (abort()+0x148) [0x7fa2e7315ce8]
>>
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fa2e7c189d5]
>>
>>  6: (()+0x5e946) [0x7fa2e7c16946]
>>
>>  7: (()+0x5e973) [0x7fa2e7c16973]
>>
>>  8: (()+0x5eb93) [0x7fa2e7c16b93]
>>
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x24a) [0xacd8ea]
>>
>>  10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long,
>> int, ThreadPool::TPHandle*)+0xa24) [0x8b8114]
>>
>>  11: (FileStore::_do_transactions(std::list> std::allocator >&, unsigned long,
>> ThreadPool::TPHandle*)+0x64) [0x8bcf34]
>>
>>  12: (FileStore::_do_op(FileStore::OpSequencer*,
>> ThreadPool::TPHandle&)+0x17e) [0x8bd0ce]
>>
>>  13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0xabe326]
>>
>>  14: (ThreadPool::WorkThread::entry()+0x10) [0xabf3d0]
>>
>>  15: (()+0x7dc5) [0x7fa2e88f3dc5]
>>
>>  16: 

Re: [ceph-users] strange cache tier behaviour with cephfs

2016-06-13 Thread Samuel Just
I'd have to look more closely, but these days promotion is
probabilistic and throttled.  During each read of those objects, it
will tend to promote a few more of them depending on how many
promotions are in progress and how hot it thinks a particular object
is.  The lack of a speed up is a bummer, but I guess you aren't
limited by the disk throughput here for some reason.  Writes can also
be passed directly to the backing tier depending on similar factors.

It's usually helpful to include the version you are running.
-Sam

On Mon, Jun 13, 2016 at 3:37 PM, Oliver Dzombic  wrote:
> Hi,
>
> i am for sure not really experienced yet with ceph or with cache tier,
> but to me it seems to behave strange.
>
> Setup:
>
> pool 3 'ssd_cache' replicated size 2 min_size 1 crush_ruleset 1
> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 190 flags
> hashpspool,incomplete_clones tier_of 4 cache_mode writeback target_bytes
> 8000 hit_set bloom{false_positive_probability: 0.05,
> target_size: 0, seed: 0} 3600s x1 decay_rate 0 search_last_n 0
> stripe_width 0
>
> pool 4 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 2
> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 169 lfor 144
> flags hashpspool crash_replay_interval 45 tiers 3 read_tier 3 write_tier
> 3 stripe_width 0
>
> pool 5 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 1
> object_hash rjenkins pg_num 128 pgp_num 128 last_change 191 flags
> hashpspool stripe_width 0
>
> hit_set_count: 1
> hit_set_period: 120
> target_max_bytes: 8000
> min_read_recency_for_promote: 0
> min_write_recency_for_promote: 0
> target_max_objects: 0
> cache_target_dirty_ratio: 0.5
> cache_target_dirty_high_ratio: 0.8
> cache_target_full_ratio: 0.9
> cache_min_flush_age: 1800
> cache_min_evict_age: 3600
>
> rule ssd-cache-rule {
> ruleset 1
> type replicated
> min_size 2
> max_size 10
> step take ssd-cache
> step chooseleaf firstn 0 type host
> step emit
> }
>
>
> rule cold-storage-rule {
> ruleset 2
> type replicated
> min_size 2
> max_size 10
> step take cold-storage
> step chooseleaf firstn 0 type host
> step emit
> }
>
>
>
> [root@cephmon1 ceph-cluster-gen2]# rados -p ssd_cache ls
> [root@cephmon1 ceph-cluster-gen2]#
> -> empty
>
> Now, on a cephfs mounted client i have files.
>
> Read operation:
>
> dd if=testfile of=/dev/zero
>
> 1494286336 bytes (1.5 GB) copied, 11.047 s, 135 MB/s
>
>
> [root@cephosd1 ~]# rados -p ssd_cache ls
> 11e.0010
> 11e.0004
> 11e.0001
> 11e.000c
> 11e.0008
> 11e.0003
> 11e.
> 11e.0002
>
> Running this multiple times after one another, does not change the
> content. Its always the same objects.
>
> -
>
> Ok, so according to the documents, writeback mode, it moved from cold
> storeage to hot storage ( cephfs_data to ssd_cache in my case ).
>
>
> Now i repeat it:
>
> dd if=testfile of=/dev/zero
>
> 1494286336 bytes (1.5 GB) copied, 11.311 s, 132 MB/s
>
>
> [root@cephosd1 ~]# rados -p ssd_cache ls
> 11e.0010
> 11e.0004
> 11e.0001
> 11e.000c
> 11e.000d
> 11e.0005
> 11e.0008
> 11e.0015
> 11e.0011
> 11e.0006
> 11e.0003
> 11e.0009
> 11e.
> 11e.000a
> 11e.001b
> 11e.0002
>
>
> So why are there now the old objects ( 8 ) plus another 8 objects ?
>
> Repeating this, will extend the numbers of objects endless without
> speeding up the dd. in the ssd_cache.
>
> So every new dd read, of exact the same file ( to me that means, same
> PGs/objects ) the (same) data is copied from cold pool to cache pool.
>
> And from there pushed to the client ( without any speed gain ).
>
> And thats not supposed to happen ( according to the documentation with
> writeback cache mode ).
>
> Similar happens when i am writing.
>
> If i write, it will store the data on cold pool and cache pool equally.
>
> For my understanding, with my configuration, at least 1800 seconds (
> cache_min_flush_age ) should pass by before the agent starts to flush
> from the cache pool to the cold pool.
>
> But it does not.
>
> So, is there something specific with cephfs, or is my config just too
> much crappy and i have no idea what i am doing here ?
>
> Anything is highly welcome !
>
> Thank you !
>
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
> ___
> ceph-users mailing list
> 

Re: [ceph-users] Upgrade Errors.

2016-06-06 Thread Samuel Just
Blargh, sounds like we need a better error message there, thanks!
-Sam

On Mon, Jun 6, 2016 at 12:16 PM, Tu Holmes <tu.hol...@gmail.com> wrote:
> It was a permission issue. While I followed the process and haven't changed
> any data. The "current map" files on each OSD were still listed as owner
> root as they were created while the older ceph processes were still running.
>
> Changing that after the fact was still a necessity and I will make sure that
> those are also properly changed.
>
> On Mon, Jun 6, 2016 at 12:12 PM Samuel Just <sj...@redhat.com> wrote:
>>
>> Oh, what was the problem (for posterity)?
>> -Sam
>>
>> On Mon, Jun 6, 2016 at 12:11 PM, Tu Holmes <tu.hol...@gmail.com> wrote:
>> > It totally did and I see what the problem is.
>> >
>> > Thanks for your input. I truly appreciate it.
>> >
>> >
>> > On Mon, Jun 6, 2016 at 12:01 PM Samuel Just <sj...@redhat.com> wrote:
>> >>
>> >> If you reproduce with
>> >>
>> >> debug osd = 20
>> >> debug filestore = 20
>> >> debug ms = 1
>> >>
>> >> that might make it clearer what is going on.
>> >> -Sam
>> >>
>> >> On Mon, Jun 6, 2016 at 11:53 AM, Tu Holmes <tu.hol...@gmail.com> wrote:
>> >> > Hey cephers. I have been following the upgrade documents and I have
>> >> > done
>> >> > everything regarding upgrading the client to the latest version of
>> >> > Hammer,
>> >> > then to Jewel.
>> >> >
>> >> > I made sure that the owner of log partitions and all other items is
>> >> > the
>> >> > ceph
>> >> > user and I've gone through the process as was described in the
>> >> > documents,
>> >> > but I am getting this on my nodes as I upgrade.
>> >> >
>> >> >
>> >> > --- begin dump of recent events ---
>> >> >
>> >> >-72> 2016-06-06 11:49:37.720315 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command perfcounters_dump hook 0x7f09dbb58050
>> >> >
>> >> >-71> 2016-06-06 11:49:37.720328 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command 1 hook 0x7f09dbb58050
>> >> >
>> >> >-70> 2016-06-06 11:49:37.720330 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command perf dump hook 0x7f09dbb58050
>> >> >
>> >> >-69> 2016-06-06 11:49:37.720332 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command perfcounters_schema hook 0x7f09dbb58050
>> >> >
>> >> >-68> 2016-06-06 11:49:37.720333 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command 2 hook 0x7f09dbb58050
>> >> >
>> >> >-67> 2016-06-06 11:49:37.720334 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command perf schema hook 0x7f09dbb58050
>> >> >
>> >> >-66> 2016-06-06 11:49:37.720335 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command perf reset hook 0x7f09dbb58050
>> >> >
>> >> >-65> 2016-06-06 11:49:37.720337 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command config show hook 0x7f09dbb58050
>> >> >
>> >> >-64> 2016-06-06 11:49:37.720338 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command config set hook 0x7f09dbb58050
>> >> >
>> >> >-63> 2016-06-06 11:49:37.720339 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command config get hook 0x7f09dbb58050
>> >> >
>> >> >-62> 2016-06-06 11:49:37.720340 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command config diff hook 0x7f09dbb58050
>> >> >
>> >> >-61> 2016-06-06 11:49:37.720342 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command log flush hook 0x7f09dbb58050
>> >> >
>> >> >-60> 2016-06-06 11:49:37.720343 7f09d0152800  5
>> >> > asok(0x7f09dbb78280)
>> >> > register_command log dump hook 0x7f09dbb58050
>> >> >
>>

Re: [ceph-users] Upgrade Errors.

2016-06-06 Thread Samuel Just
Oh, what was the problem (for posterity)?
-Sam

On Mon, Jun 6, 2016 at 12:11 PM, Tu Holmes <tu.hol...@gmail.com> wrote:
> It totally did and I see what the problem is.
>
> Thanks for your input. I truly appreciate it.
>
>
> On Mon, Jun 6, 2016 at 12:01 PM Samuel Just <sj...@redhat.com> wrote:
>>
>> If you reproduce with
>>
>> debug osd = 20
>> debug filestore = 20
>> debug ms = 1
>>
>> that might make it clearer what is going on.
>> -Sam
>>
>> On Mon, Jun 6, 2016 at 11:53 AM, Tu Holmes <tu.hol...@gmail.com> wrote:
>> > Hey cephers. I have been following the upgrade documents and I have done
>> > everything regarding upgrading the client to the latest version of
>> > Hammer,
>> > then to Jewel.
>> >
>> > I made sure that the owner of log partitions and all other items is the
>> > ceph
>> > user and I've gone through the process as was described in the
>> > documents,
>> > but I am getting this on my nodes as I upgrade.
>> >
>> >
>> > --- begin dump of recent events ---
>> >
>> >-72> 2016-06-06 11:49:37.720315 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command perfcounters_dump hook 0x7f09dbb58050
>> >
>> >-71> 2016-06-06 11:49:37.720328 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command 1 hook 0x7f09dbb58050
>> >
>> >-70> 2016-06-06 11:49:37.720330 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command perf dump hook 0x7f09dbb58050
>> >
>> >-69> 2016-06-06 11:49:37.720332 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command perfcounters_schema hook 0x7f09dbb58050
>> >
>> >-68> 2016-06-06 11:49:37.720333 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command 2 hook 0x7f09dbb58050
>> >
>> >-67> 2016-06-06 11:49:37.720334 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command perf schema hook 0x7f09dbb58050
>> >
>> >-66> 2016-06-06 11:49:37.720335 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command perf reset hook 0x7f09dbb58050
>> >
>> >-65> 2016-06-06 11:49:37.720337 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command config show hook 0x7f09dbb58050
>> >
>> >-64> 2016-06-06 11:49:37.720338 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command config set hook 0x7f09dbb58050
>> >
>> >-63> 2016-06-06 11:49:37.720339 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command config get hook 0x7f09dbb58050
>> >
>> >-62> 2016-06-06 11:49:37.720340 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command config diff hook 0x7f09dbb58050
>> >
>> >-61> 2016-06-06 11:49:37.720342 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command log flush hook 0x7f09dbb58050
>> >
>> >-60> 2016-06-06 11:49:37.720343 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command log dump hook 0x7f09dbb58050
>> >
>> >-59> 2016-06-06 11:49:37.720344 7f09d0152800  5 asok(0x7f09dbb78280)
>> > register_command log reopen hook 0x7f09dbb58050
>> >
>> >-58> 2016-06-06 11:49:37.723459 7f09d0152800  0 set uid:gid to
>> > 1000:1000
>> > (ceph:ceph)
>> >
>> >-57> 2016-06-06 11:49:37.723476 7f09d0152800  0 ceph version 10.2.1
>> > (3a66dd4f30852819c1bdaa8ec23c795d4ad77269), process ceph-osd, pid 9943
>> >
>> >-56> 2016-06-06 11:49:37.727080 7f09d0152800  1 -- 10.253.50.213:0/0
>> > learned my addr 10.253.50.213:0/0
>> >
>> >-55> 2016-06-06 11:49:37.727092 7f09d0152800  1
>> > accepter.accepter.bind
>> > my_inst.addr is 10.253.50.213:6806/9943 need_addr=0
>> >
>> >-54> 2016-06-06 11:49:37.727104 7f09d0152800  1 -- 172.16.1.3:0/0
>> > learned
>> > my addr 172.16.1.3:0/0
>> >
>> >-53> 2016-06-06 11:49:37.727109 7f09d0152800  1
>> > accepter.accepter.bind
>> > my_inst.addr is 172.16.1.3:6806/9943 need_addr=0
>> >
>> >-52> 2016-06-06 11:49:37.727119 7f09d0152800  1 -- 172.16.1.3:0/0
>> > learned
>> > my addr 172.16.1.3:0/0
>> >
>> >-51> 2016-06-06 11:49:37.727129 7f09d0152800  1
>> > accepter.accepter.bind
>> > my_inst.addr is 172.16.1.3:6807/9943 need_addr=0
>> >
>> >-50> 2016-06-06 11:49:37.727139 7f09d0152800  1 -- 10.253.50.213:0/0
>> > learned my addr 10.2

Re: [ceph-users] Upgrade Errors.

2016-06-06 Thread Samuel Just
If you reproduce with

debug osd = 20
debug filestore = 20
debug ms = 1

that might make it clearer what is going on.
-Sam

On Mon, Jun 6, 2016 at 11:53 AM, Tu Holmes  wrote:
> Hey cephers. I have been following the upgrade documents and I have done
> everything regarding upgrading the client to the latest version of Hammer,
> then to Jewel.
>
> I made sure that the owner of log partitions and all other items is the ceph
> user and I've gone through the process as was described in the documents,
> but I am getting this on my nodes as I upgrade.
>
>
> --- begin dump of recent events ---
>
>-72> 2016-06-06 11:49:37.720315 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command perfcounters_dump hook 0x7f09dbb58050
>
>-71> 2016-06-06 11:49:37.720328 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command 1 hook 0x7f09dbb58050
>
>-70> 2016-06-06 11:49:37.720330 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command perf dump hook 0x7f09dbb58050
>
>-69> 2016-06-06 11:49:37.720332 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command perfcounters_schema hook 0x7f09dbb58050
>
>-68> 2016-06-06 11:49:37.720333 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command 2 hook 0x7f09dbb58050
>
>-67> 2016-06-06 11:49:37.720334 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command perf schema hook 0x7f09dbb58050
>
>-66> 2016-06-06 11:49:37.720335 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command perf reset hook 0x7f09dbb58050
>
>-65> 2016-06-06 11:49:37.720337 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command config show hook 0x7f09dbb58050
>
>-64> 2016-06-06 11:49:37.720338 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command config set hook 0x7f09dbb58050
>
>-63> 2016-06-06 11:49:37.720339 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command config get hook 0x7f09dbb58050
>
>-62> 2016-06-06 11:49:37.720340 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command config diff hook 0x7f09dbb58050
>
>-61> 2016-06-06 11:49:37.720342 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command log flush hook 0x7f09dbb58050
>
>-60> 2016-06-06 11:49:37.720343 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command log dump hook 0x7f09dbb58050
>
>-59> 2016-06-06 11:49:37.720344 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command log reopen hook 0x7f09dbb58050
>
>-58> 2016-06-06 11:49:37.723459 7f09d0152800  0 set uid:gid to 1000:1000
> (ceph:ceph)
>
>-57> 2016-06-06 11:49:37.723476 7f09d0152800  0 ceph version 10.2.1
> (3a66dd4f30852819c1bdaa8ec23c795d4ad77269), process ceph-osd, pid 9943
>
>-56> 2016-06-06 11:49:37.727080 7f09d0152800  1 -- 10.253.50.213:0/0
> learned my addr 10.253.50.213:0/0
>
>-55> 2016-06-06 11:49:37.727092 7f09d0152800  1 accepter.accepter.bind
> my_inst.addr is 10.253.50.213:6806/9943 need_addr=0
>
>-54> 2016-06-06 11:49:37.727104 7f09d0152800  1 -- 172.16.1.3:0/0 learned
> my addr 172.16.1.3:0/0
>
>-53> 2016-06-06 11:49:37.727109 7f09d0152800  1 accepter.accepter.bind
> my_inst.addr is 172.16.1.3:6806/9943 need_addr=0
>
>-52> 2016-06-06 11:49:37.727119 7f09d0152800  1 -- 172.16.1.3:0/0 learned
> my addr 172.16.1.3:0/0
>
>-51> 2016-06-06 11:49:37.727129 7f09d0152800  1 accepter.accepter.bind
> my_inst.addr is 172.16.1.3:6807/9943 need_addr=0
>
>-50> 2016-06-06 11:49:37.727139 7f09d0152800  1 -- 10.253.50.213:0/0
> learned my addr 10.253.50.213:0/0
>
>-49> 2016-06-06 11:49:37.727143 7f09d0152800  1 accepter.accepter.bind
> my_inst.addr is 10.253.50.213:6807/9943 need_addr=0
>
>-48> 2016-06-06 11:49:37.727148 7f09d0152800  0 pidfile_write: ignore
> empty --pid-file
>
>-47> 2016-06-06 11:49:37.728364 7f09d0152800  5 asok(0x7f09dbb78280) init
> /var/run/ceph/ceph-osd.8.asok
>
>-46> 2016-06-06 11:49:37.728417 7f09d0152800  5 asok(0x7f09dbb78280)
> bind_and_listen /var/run/ceph/ceph-osd.8.asok
>
>-45> 2016-06-06 11:49:37.728472 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command 0 hook 0x7f09dbb54110
>
>-44> 2016-06-06 11:49:37.728488 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command version hook 0x7f09dbb54110
>
>-43> 2016-06-06 11:49:37.728493 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command git_version hook 0x7f09dbb54110
>
>-42> 2016-06-06 11:49:37.728498 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command help hook 0x7f09dbb58230
>
>-41> 2016-06-06 11:49:37.728502 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command get_command_descriptions hook 0x7f09dbb58220
>
>-40> 2016-06-06 11:49:37.728544 7f09d0152800 10 monclient(hunting):
> build_initial_monmap
>
>-39> 2016-06-06 11:49:37.734765 7f09c9df5700  5 asok(0x7f09dbb78280)
> entry start
>
>-38> 2016-06-06 11:49:37.736541 7f09d0152800  5 adding auth protocol:
> cephx
>
>-37> 2016-06-06 11:49:37.736552 7f09d0152800  5 adding auth protocol:
> cephx
>
>-36> 2016-06-06 11:49:37.736672 7f09d0152800  5 asok(0x7f09dbb78280)
> register_command 

[ceph-users] jewel upgrade and sortbitwise

2016-06-02 Thread Samuel Just
Due to http://tracker.ceph.com/issues/16113, it would be best to avoid
setting the sortbitwise flag on jewel clusters upgraded from previous
versions until we get a point release out with a fix.

The symptom is that setting the sortbitwise flag on a jewel cluster
upgraded from a previous version can result in some pgs reporting
spurious unfound objects.  Unsetting sortbitwise should cause the PGs
to go back to normal.  Clusters created at jewel don't need to worry
about this.
-Sam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-02 Thread Samuel Just
I suspect the problem is that ReplicatedBackend::build_push_op assumes
that osd_recovery_max_chunk (defaults to 8MB) of omap entries is about
the same amount of work to get as 8MB of normal object data.  The fix
would be to add another config osd_recovery_max_omap_entries_per_chunk
with a sane default (50k?) and update ReplicatedBackend::build_push_op
to use it.

http://tracker.ceph.com/issues/16128

You might be able to work around it by setting the recovery chunk to
be a lot smaller than 8MB.
-Sam

On Thu, Jun 2, 2016 at 10:22 AM, Gregory Farnum  wrote:
> On Thu, Jun 2, 2016 at 9:49 AM, Adam Tygart  wrote:
>> Okay,
>>
>> Exporting, removing and importing the pgs seems to be working
>> (slowly). The question now becomes, why does and export/import work?
>> That would make me think there is a bug in there somewhere in the pg
>> loading code. Or does it have to do with re-creating the leveldb
>> databases? The same number of objects are still in each pg, along with
>> the same number of omap keys... Something doesn't seem quite right.
>
> Your OSDs are timing out because a monitor thinks the thread doing
> recovery has died.
> It thinks it dies because the recovery thread hasn't checked in for
> 300 (!) seconds.
> This is apparently because it's taking that long to read out the omap stuff.
>
> ceph-objectstore-tool, on the other hand, doesn't care how long the
> reads take. Plus, yes, dumping all the keys into leveldb at once means
> they're contiguous and much faster to access in the future (until it
> gets fragmented again).
>
>>
>> If it is too many files in a single directory, what would be the upper
>> limit to target? I'd like to know when I should be yelling and kicking
>> and screaming at my users to fix their code.
>
> That really depends on your hardware, config, and load. SSDs can
> probably do a lot more than HDDs; the fewer competing operations
> happening simultaneously, the more ops your omap read can make happen
> within a timeout period.
>
> This all said, I'm a little surprised that you're timing out *that
> badly* on only 1 million omap entries. I thought it took more than
> that, but if the OSD is busy doing other things I guess not — you're
> talking about reading 1 million entries, plus everything else the OSD
> is doing, taking > ~3 ops (300s*100ops/s). 30 entries per seek
> would be really low, but if it's only getting a fragment of the actual
> disk ops due to client reads (and any other recovery going on, if you
> have multiple backfills?), it becomes more plausible...
>
> Unless Sam thinks this might be indicative of some other problem in the 
> system?
> -Greg
>
>>
>> On Wed, Jun 1, 2016 at 6:07 PM, Brandon Morris, PMP
>>  wrote:
>>> I concur with Greg.
>>>
>>> The only way that I was able to get back to Health_OK was to export/import.
>>> * Please note, any time you use the ceph_objectstore_tool you risk data
>>> loss if not done carefully.   Never remove a PG until you have a known good
>>> export *
>>>
>>> Here are the steps I used:
>>>
>>> 1. set NOOUT, NO BACKFILL
>>> 2. Stop the OSD's that have the erroring PG
>>> 3. Flush the journal and export the primary version of the PG.  This took 1
>>> minute on a well-behaved PG and 4 hours on the misbehaving PG
>>>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
>>> --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
>>> --file /root/32.10c.b.export
>>>
>>> 4. Import the PG into a New / Temporary OSD that is also offline,
>>>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
>>> --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op export
>>> --file /root/32.10c.b.export
>>>
>>> 5. remove the PG from all other OSD's  (16, 143, 214, and 448 in your case
>>> it looks like)
>>> 6. Start cluster OSD's
>>> 7. Start the temporary OSD's and ensure 32.10c backfills correctly to the 3
>>> OSD's it is supposed to be on.
>>>
>>> This is similar to the recovery process described in this post from
>>> 04/09/2015:
>>> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
>>> Hopefully it works in your case too and you can the cluster back to a state
>>> that you can make the CephFS directories smaller.
>>>
>>> - Brandon
>>>
>>> On Wed, Jun 1, 2016 at 4:22 PM, Gregory Farnum  wrote:

 On Wed, Jun 1, 2016 at 2:47 PM, Adam Tygart  wrote:
 > I tried to compact the leveldb on osd 16 and the osd is still hitting
 > the suicide timeout. I know I've got some users with more than 1
 > million files in single directories.
 >
 > Now that I'm in this situation, can I get some pointers on how can I
 > use either of your options?

 In a literal sense, you either make the CephFS directories smaller by
 moving files out of them. Or you enable directory fragmentation with

 

Re: [ceph-users] OSD Restart results in "unfound objects"

2016-06-01 Thread Samuel Just
http://tracker.ceph.com/issues/16113

I think I found the bug.  Thanks for the report!  Turning off
sortbitwise should be an ok workaround for the moment.
-Sam

On Wed, Jun 1, 2016 at 3:00 PM, Diego Castro
<diego.cas...@getupcloud.com> wrote:
> Yes, it was created as Hammer.
> I haven't faced any issues on the upgrade (despite the well know systemd),
> and after that the cluster didn't show any suspicious behavior.
>
>
> ---
> Diego Castro / The CloudFather
> GetupCloud.com - Eliminamos a Gravidade
>
> 2016-06-01 18:57 GMT-03:00 Samuel Just <sj...@redhat.com>:
>>
>> Was this cluster upgraded to jewel?  If so, at what version did it start?
>> -Sam
>>
>> On Wed, Jun 1, 2016 at 1:48 PM, Diego Castro
>> <diego.cas...@getupcloud.com> wrote:
>> > Hello Samuel, i'm bit afraid of restarting my osd's again, i'll wait
>> > until
>> > the weekend to push the config.
>> > BTW, i just unset sortbitwise flag.
>> >
>> >
>> > ---
>> > Diego Castro / The CloudFather
>> > GetupCloud.com - Eliminamos a Gravidade
>> >
>> > 2016-06-01 13:39 GMT-03:00 Samuel Just <sj...@redhat.com>:
>> >>
>> >> Can either of you reproduce with logs?  That would make it a lot
>> >> easier to track down if it's a bug.  I'd want
>> >>
>> >> debug osd = 20
>> >> debug ms = 1
>> >> debug filestore = 20
>> >>
>> >> On all of the osds for a particular pg from when it is clean until it
>> >> develops an unfound object.
>> >> -Sam
>> >>
>> >> On Wed, Jun 1, 2016 at 5:36 AM, Diego Castro
>> >> <diego.cas...@getupcloud.com> wrote:
>> >> > Hello Uwe, i also have sortbitwise flag enable and i have the exactly
>> >> > behavior of yours.
>> >> > Perhaps this is also the root of my issues, does anybody knows if is
>> >> > safe to
>> >> > disable it?
>> >> >
>> >> >
>> >> > ---
>> >> > Diego Castro / The CloudFather
>> >> > GetupCloud.com - Eliminamos a Gravidade
>> >> >
>> >> > 2016-06-01 7:17 GMT-03:00 Uwe Mesecke <u...@mesecke.net>:
>> >> >>
>> >> >>
>> >> >> > Am 01.06.2016 um 10:25 schrieb Diego Castro
>> >> >> > <diego.cas...@getupcloud.com>:
>> >> >> >
>> >> >> > Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon.
>> >> >> > Today my cluster suddenly went unhealth with lots of stuck pg's
>> >> >> > due
>> >> >> > unfound objects, no disks failures nor node crashes, it just went
>> >> >> > bad.
>> >> >> >
>> >> >> > I managed to put the cluster on health state again by marking lost
>> >> >> > objects to delete "ceph pg  mark_unfound_lost delete".
>> >> >> > Regarding the fact that i have no idea why the cluster gone bad, i
>> >> >> > realized restarting the osd' daemons to unlock stuck clients put
>> >> >> > the
>> >> >> > cluster
>> >> >> > on unhealth and pg gone stuck again due unfound objects.
>> >> >> >
>> >> >> > Does anyone have this issue?
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I also ran into that problem after upgrading to jewel. In my case I
>> >> >> was
>> >> >> able to somewhat correlate this behavior with setting the
>> >> >> sortbitwise
>> >> >> flag
>> >> >> after the upgrade. When the flag is set, after some time these
>> >> >> unfound
>> >> >> objects are popping up. Restarting osds just makes it worse and/or
>> >> >> makes
>> >> >> these problems appear faster. When looking at the missing objects I
>> >> >> can
>> >> >> see
>> >> >> that sometimes even region or zone configuration objects for radosgw
>> >> >> are
>> >> >> missing which I know are there because the radosgw was using these
>> >> >> just
>> >> >> before.
>> >> >>
>> >> >> After unsetting the sortbitwise flag, the PGs go back to normal, all
>> >> >> previously unfound objects are found and the cluster becomes healthy
>> >> >> again.
>> >> >>
>> >> >> Of course I’m not sure whether this is the real root of the problem
>> >> >> or
>> >> >> just a coincidence but I can reproduce this behavior every time.
>> >> >>
>> >> >> So for now the cluster is running without this flag. :-/
>> >> >>
>> >> >> Regards,
>> >> >> Uwe
>> >> >>
>> >> >> >
>> >> >> > ---
>> >> >> > Diego Castro / The CloudFather
>> >> >> > GetupCloud.com - Eliminamos a Gravidade
>> >> >> > ___
>> >> >> > ceph-users mailing list
>> >> >> > ceph-users@lists.ceph.com
>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>
>> >> >
>> >> >
>> >> > ___
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >
>> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Restart results in "unfound objects"

2016-06-01 Thread Samuel Just
Was this cluster upgraded to jewel?  If so, at what version did it start?
-Sam

On Wed, Jun 1, 2016 at 1:48 PM, Diego Castro
<diego.cas...@getupcloud.com> wrote:
> Hello Samuel, i'm bit afraid of restarting my osd's again, i'll wait until
> the weekend to push the config.
> BTW, i just unset sortbitwise flag.
>
>
> ---
> Diego Castro / The CloudFather
> GetupCloud.com - Eliminamos a Gravidade
>
> 2016-06-01 13:39 GMT-03:00 Samuel Just <sj...@redhat.com>:
>>
>> Can either of you reproduce with logs?  That would make it a lot
>> easier to track down if it's a bug.  I'd want
>>
>> debug osd = 20
>> debug ms = 1
>> debug filestore = 20
>>
>> On all of the osds for a particular pg from when it is clean until it
>> develops an unfound object.
>> -Sam
>>
>> On Wed, Jun 1, 2016 at 5:36 AM, Diego Castro
>> <diego.cas...@getupcloud.com> wrote:
>> > Hello Uwe, i also have sortbitwise flag enable and i have the exactly
>> > behavior of yours.
>> > Perhaps this is also the root of my issues, does anybody knows if is
>> > safe to
>> > disable it?
>> >
>> >
>> > ---
>> > Diego Castro / The CloudFather
>> > GetupCloud.com - Eliminamos a Gravidade
>> >
>> > 2016-06-01 7:17 GMT-03:00 Uwe Mesecke <u...@mesecke.net>:
>> >>
>> >>
>> >> > Am 01.06.2016 um 10:25 schrieb Diego Castro
>> >> > <diego.cas...@getupcloud.com>:
>> >> >
>> >> > Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon.
>> >> > Today my cluster suddenly went unhealth with lots of stuck pg's  due
>> >> > unfound objects, no disks failures nor node crashes, it just went
>> >> > bad.
>> >> >
>> >> > I managed to put the cluster on health state again by marking lost
>> >> > objects to delete "ceph pg  mark_unfound_lost delete".
>> >> > Regarding the fact that i have no idea why the cluster gone bad, i
>> >> > realized restarting the osd' daemons to unlock stuck clients put the
>> >> > cluster
>> >> > on unhealth and pg gone stuck again due unfound objects.
>> >> >
>> >> > Does anyone have this issue?
>> >>
>> >> Hi,
>> >>
>> >> I also ran into that problem after upgrading to jewel. In my case I was
>> >> able to somewhat correlate this behavior with setting the sortbitwise
>> >> flag
>> >> after the upgrade. When the flag is set, after some time these unfound
>> >> objects are popping up. Restarting osds just makes it worse and/or
>> >> makes
>> >> these problems appear faster. When looking at the missing objects I can
>> >> see
>> >> that sometimes even region or zone configuration objects for radosgw
>> >> are
>> >> missing which I know are there because the radosgw was using these just
>> >> before.
>> >>
>> >> After unsetting the sortbitwise flag, the PGs go back to normal, all
>> >> previously unfound objects are found and the cluster becomes healthy
>> >> again.
>> >>
>> >> Of course I’m not sure whether this is the real root of the problem or
>> >> just a coincidence but I can reproduce this behavior every time.
>> >>
>> >> So for now the cluster is running without this flag. :-/
>> >>
>> >> Regards,
>> >> Uwe
>> >>
>> >> >
>> >> > ---
>> >> > Diego Castro / The CloudFather
>> >> > GetupCloud.com - Eliminamos a Gravidade
>> >> > ___
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Restart results in "unfound objects"

2016-06-01 Thread Samuel Just
Can either of you reproduce with logs?  That would make it a lot
easier to track down if it's a bug.  I'd want

debug osd = 20
debug ms = 1
debug filestore = 20

On all of the osds for a particular pg from when it is clean until it
develops an unfound object.
-Sam

On Wed, Jun 1, 2016 at 5:36 AM, Diego Castro
 wrote:
> Hello Uwe, i also have sortbitwise flag enable and i have the exactly
> behavior of yours.
> Perhaps this is also the root of my issues, does anybody knows if is safe to
> disable it?
>
>
> ---
> Diego Castro / The CloudFather
> GetupCloud.com - Eliminamos a Gravidade
>
> 2016-06-01 7:17 GMT-03:00 Uwe Mesecke :
>>
>>
>> > Am 01.06.2016 um 10:25 schrieb Diego Castro
>> > :
>> >
>> > Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon.
>> > Today my cluster suddenly went unhealth with lots of stuck pg's  due
>> > unfound objects, no disks failures nor node crashes, it just went bad.
>> >
>> > I managed to put the cluster on health state again by marking lost
>> > objects to delete "ceph pg  mark_unfound_lost delete".
>> > Regarding the fact that i have no idea why the cluster gone bad, i
>> > realized restarting the osd' daemons to unlock stuck clients put the 
>> > cluster
>> > on unhealth and pg gone stuck again due unfound objects.
>> >
>> > Does anyone have this issue?
>>
>> Hi,
>>
>> I also ran into that problem after upgrading to jewel. In my case I was
>> able to somewhat correlate this behavior with setting the sortbitwise flag
>> after the upgrade. When the flag is set, after some time these unfound
>> objects are popping up. Restarting osds just makes it worse and/or makes
>> these problems appear faster. When looking at the missing objects I can see
>> that sometimes even region or zone configuration objects for radosgw are
>> missing which I know are there because the radosgw was using these just
>> before.
>>
>> After unsetting the sortbitwise flag, the PGs go back to normal, all
>> previously unfound objects are found and the cluster becomes healthy again.
>>
>> Of course I’m not sure whether this is the real root of the problem or
>> just a coincidence but I can reproduce this behavior every time.
>>
>> So for now the cluster is running without this flag. :-/
>>
>> Regards,
>> Uwe
>>
>> >
>> > ---
>> > Diego Castro / The CloudFather
>> > GetupCloud.com - Eliminamos a Gravidade
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] unfound objects - why and how to recover ? (bonus : jewel logs)

2016-05-27 Thread Samuel Just
Well, it's not supposed to do that if the backing storage is working
properly.  If the filesystem/disk controller/disk combination is not
respecting barriers (or otherwise can lose committed data in a power
failure) in your configuration, a power failure could cause a node to
go backwards in time -- that would explain it.  Without logs, I can't
say any more.  If you can reproduce, we'll want

debug osd = 20
debug filestore = 20
debug ms = 1

on all of the osds involved in an affected PG.
-Sam

On Fri, May 27, 2016 at 7:04 AM, SCHAER Frederic  wrote:
> Hi,
>
>
>
> --
>
> First, let me start with the bonus…
>
> I migrated from hammer => jewel and followed the migration instructions… but
> migrations instructions are missing this :
>
> #chown  -R ceph:ceph /var/log/ceph
>
> I just discoved this was the reason I found no log nowhere about my current
> issue :/
>
> --
>
>
>
> This is maybe the 3rd time this happens to me … This time I’d like to try to
> understand what happens.
>
>
>
> So. ceph-10.2.0-0.el7.x86_64+Cent0S 7.2 here.
>
> Ceph health was happy, but any rbd operation was hanging – hence : ceph was
> hung, and so were the test VMs running on it.
>
>
>
> I placed my VM in an EC pool on top of which I overlayed an RBD pool with
> SSDs.
>
> The EC pool is defined as being a 3+1 pool, with 5 hosts hosting the OSDs
> (and the failure domain is set to hosts)
>
>
>
> “Ceph –w” wasn’t displaying new status lines as usual, but ceph health
> (detail) wasn’t saying anything would be wrong.
>
> After looking at one node, I found that ceph logs were empty on one node, so
> I decided to restart the OSDs on that one using : systemctl restart
> ceph-osd@*
>
>
>
> After I did that, ceph –w got to life again , but telling me there was a
> dead MON – which I restarted too.
>
> I watched some kind of recovery happening, and after a few seconds/minutes,
> I now see :
>
>
>
> [root@ceph0 ~]# ceph health detail
>
> HEALTH_WARN 4 pgs degraded; 3 pgs recovering; 1 pgs recovery_wait; 4 pgs
> stuck unclean; recovery 57/373846 objects degraded (0.015%); recovery
> 57/110920 unfound (0.051%)
>
> pg 691.65 is stuck unclean for 310704.556119, current state
> active+recovery_wait+degraded, last acting [44,99,69,9]
>
> pg 691.1e5 is stuck unclean for 493631.370697, current state
> active+recovering+degraded, last acting [77,43,20,99]
>
> pg 691.12a is stuck unclean for 14521.475478, current state
> active+recovering+degraded, last acting [42,56,7,106]
>
> pg 691.165 is stuck unclean for 14521.474525, current state
> active+recovering+degraded, last acting [21,71,24,117]
>
> pg 691.165 is active+recovering+degraded, acting [21,71,24,117], 15 unfound
>
> pg 691.12a is active+recovering+degraded, acting [42,56,7,106], 1 unfound
>
> pg 691.1e5 is active+recovering+degraded, acting [77,43,20,99], 2 unfound
>
> pg 691.65 is active+recovery_wait+degraded, acting [44,99,69,9], 39 unfound
>
> recovery 57/373846 objects degraded (0.015%)
>
> recovery 57/110920 unfound (0.051%)
>
>
>
> Damn.
>
> Last time this happened, I was forced to declare lost the PGs in order to
> recover a “healthy” ceph, because ceph does not want to revert PGs in EC
> pools. But one of the VMs started hanging randomly on disk IOs…
>
> This same VM is now down, and I can’t remove its disk from rbd, it’s hanging
> at 99% - I could work that around by renaming the file and re-installing the
> VM on a new disk, but anyway, I’d like to understand+fix+make sure this does
> not happen again.
>
> We sometimes suffer power cuts here : if restarting daemons kills ceph data,
> I cannot think of what would happen in case of power cut…
>
>
>
> Back to the unfound objects. I have no OSD down that would be in the cluster
> (only 1 down, and I put it myself down – OSD.46 - , but set its weight to 0
> last week)
>
> I can query the PGs, but I don’t understand what I see in there.
>
> For instance :
>
>
>
> #ceph pg 691.65 query
>
> (…)
>
> "num_objects_missing": 0,
>
> "num_objects_degraded": 39,
>
> "num_objects_misplaced": 0,
>
> "num_objects_unfound": 39,
>
> "num_objects_dirty": 138,
>
>
>
> And then for 2 peers I see :
>
> "state": "active+undersized+degraded", ## undersized ???
>
> (…)
>
> "num_objects_missing": 0,
>
> "num_objects_degraded": 138,
>
> "num_objects_misplaced": 138,
>
> "num_objects_unfound": 0,
>
> "num_objects_dirty": 138,
>
> "blocked_by": [],
>
> "up_primary": 44,
>
> "acting_primary": 44
>
>
>
>
>
> If I look at the “missing” objects, I can see something on some OSDs :
>
> # ceph pg 691.165 list_missing
>
> (…)
>
> {
>
> "oid": {
>
> "oid": "rbd_data.8de32431bd7b7.0ea7",
>
> "key": "",
>
> "snapid": -2,
>
> 

Re: [ceph-users] help removing an rbd image?

2016-05-26 Thread Samuel Just
IIRC, the rbd_directory isn't supposed to be removed, so that's fine.

In summary, it works correctly with filestore, but not with bluestore?
 In that case, the next step is to file a bug.  Ideally, you'd
reproduce with only 3 osds with debugging (specified below) on all
osds from cluster startup through reproduction and attach all 3 osd
logs to the bug.  I'm not sure whether it's terribly urgent to fix
since bluestore undergoing some significant changes right now.  Sage?

debugging:
debug bluestore = 20
debug osd = 20
debug ms = 1
-Sam

On Thu, May 26, 2016 at 10:27 AM, Kevan Rehm <kr...@cray.com> wrote:
> Samuel,
>
> Back again.   I converted my cluster to use 24 filestore OSDs, and ran the
> following test three times:
>
> rbd -p ssd_replica create --size 100G image1
> rbd --pool ssd_replica bench-write --io-size 2M --io-threads 16 --io-total
> 100G --io-pattern seq image1
> rbd -p ssd_replica rm image1
>
>
> and in each case the rbd image was removed successfully, rados showed no
> leftover objects other than 'rbd_directory' which I could remove with
> rados.  (The pool is empty other than for this one image.)  I then
> converted the cluster back to all-bluestore OSDs.  For my first run I only
> created a 10G-sized image1 object, and that was removed successfully.  I
> then repeated the above with 100G, and the problem reappeared, I again
> have objects I cannot remove.  (bluestore warnings removed for brevity).
>
> [root@alpha1-p200 ~]# rbd -p ssd_replica ls
> image1
> [root@alpha1-p200 ~]# rbd -p ssd_replica rm image1
> Removing image: 100% complete...done.
> [root@alpha1-p200 ~]# rbd -p ssd_replica ls
> image1
> [root@alpha1-p200 ~]# rados -p ssd_replica ls | wc -l
> 8582
>
> This is slightly different than last time in that the rbd image 'image1'
> still appears in the pool, I don't get the "No such file or directory"
> errors anymore for the rbd image.  But I do get those errors for the
> leftover objects that make up the image when I try to remove them:
>
> [root@alpha1-p200 ~]# rados -p ssd_replica rm
> rbd_data.112d238e1f29.2d3d
> error removing ssd_replica>rbd_data.112d238e1f29.2d3d: (2) No
> such file or directory
>
> I'm not sure where to go from here.   Suggestions?
>
>
> Kevan
>
>
>
> On 5/24/16, 4:11 PM, "Kevan Rehm" <kr...@cray.com> wrote:
>
>>Okay, will do.   If the problem goes away with filestore, I'll switch back
>>to bluestore again and re-duplicate the problem.  In that case, are there
>>particular things you would like me to collect?   Or clues I should look
>>for in logs?
>>
>>Thanks, Kevan
>>
>>On 5/24/16, 4:06 PM, "Samuel Just" <sj...@redhat.com> wrote:
>>
>>>My money is on bluestore.  If you can try to reproduce on filestore,
>>>that would rapidly narrow it down.
>>>-Sam
>>>
>>>On Tue, May 24, 2016 at 1:53 PM, Kevan Rehm <kr...@cray.com> wrote:
>>>> Nope, not using tiering.
>>>>
>>>> Also, this is my second attempt, this is repeatable for me, I'm trying
>>>>to
>>>> duplicate a previous occurrence of this same problem to collect useful
>>>> debug data.  In the previous case, I was eventually able to get rid of
>>>>the
>>>> objects (but have forgotten how), but that was followed by 22 of the 24
>>>> OSDs crashing hard.  Took me quite a while to re-deploy and get it
>>>>working
>>>> again.  I want to get a small, repeatable example for the Ceph guys to
>>>> look at, assuming it's a bug.
>>>>
>>>> Don't know if it's related to the bluestore OSDs or not, still getting
>>>>my
>>>> feet wet with Ceph.
>>>>
>>>> Kevan
>>>>
>>>> On 5/24/16, 3:47 PM, "Jason Dillaman" <jdill...@redhat.com> wrote:
>>>>
>>>>>Any chance you are using cache tiering?  It's odd that you can see the
>>>>>objects through "rados ls" but cannot delete them with "rados rm".
>>>>>
>>>>>On Tue, May 24, 2016 at 4:34 PM, Kevan Rehm <kr...@cray.com> wrote:
>>>>>> Greetings,
>>>>>>
>>>>>> I have a small Ceph 10.2.1 test cluster using a 3-replicate pool
>>>>>>based
>>>>>>on 24
>>>>>> SSDs configured with bluestore.  I created and wrote an rbd image
>>>>>>called
>>>>>> "image1", then deleted the image again.
>>>>>>
>>>>>> rbd -p ssd_replica create --size

Re: [ceph-users] ceph hang on pg list_unfound

2016-05-19 Thread Samuel Just
Restart osd.1 with debugging enabled

debug osd = 20
debug filestore = 20
debug ms = 1

Then, run list_unfound once the pg is back in active+recovering.  If
it still hangs, post osd.1's log to the list along with the output of
ceph osd dump and ceph pg dump.
-Sam

On Wed, May 18, 2016 at 6:20 PM, Don Waterloo  wrote:
> I am running 10.2.0-0ubuntu0.16.04.1.
> I've run into a problem w/ cephfs metadata pool. Specifically I have a pg w/
> an 'unfound' object.
>
> But i can't figure out which since when i run:
> ceph pg 12.94 list_unfound
>
> it hangs (as does ceph pg 12.94 query). I know its in the cephfs metadata
> pool since I run:
> ceph pg ls-by-pool cephfs_metadata |egrep "pg_stat|12\\.94"
>
> and it shows it there:
> pg_stat objects mip degrmispunf bytes   log disklog
> state   state_stamp v   reportedup  up_primary
> acting  acting_primary  last_scrub  scrub_stamp last_deep_scrub
> deep_scrub_stamp
> 12.94   231 1   1   0   1   90  30923092
> active+recovering+degraded  2016-05-18 23:49:15.718772  8957'386130
> 9472:367098 [1,4]   1   [1,4]   1   8935'385144 2016-05-18
> 10:46:46.123526 8337'379527 2016-05-14 22:37:05.974367
>
> OK, so what is hanging, and how can i get it to unhang so i can run a
> 'mark_unfound_lost' on it?
>
> pg 12.94 is on osd.0
>
> ID WEIGHT  TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 5.48996 root default
> -2 0.8 host nubo-1
>  0 0.8 osd.0 up  1.0  1.0
> -3 0.8 host nubo-2
>  1 0.8 osd.1 up  1.0  1.0
> -4 0.8 host nubo-3
>  2 0.8 osd.2 up  1.0  1.0
> -5 0.92999 host nubo-19
>  3 0.92999 osd.3 up  1.0  1.0
> -6 0.92999 host nubo-20
>  4 0.92999 osd.4 up  1.0  1.0
> -7 0.92999 host nubo-21
>  5 0.92999 osd.5 up  1.0  1.0
>
> I cranked the logging on osd.0. I see a lot of messages, but nothing
> interesting.
>
> I've double checked all nodes can ping each other. I've run 'xfs_repair' on
> the underlying xfs storage to check for issues (there were none).
>
> Can anyone suggest how to uncrack this hang so i can try and repair this
> system?
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck incomplete after power failure.

2016-05-17 Thread Samuel Just
Try restarting the primary osd for that pg with
osd_find_best_info_ignore_history_les set to true (don't leave it set
long term).
-Sam

On Tue, May 17, 2016 at 7:50 AM, Hein-Pieter van Braam  wrote:
> Hello,
>
> Today we had a power failure in a rack housing our OSD servers. We had
> 7 of our 30 total OSD nodes down. Of the affect PG 2 out of the 3 OSDs
> went down.
>
> After everything was back and mostly healthy I found one placement
> group marked as incomplete. I can't figure out why.
>
> I'm running ceph 0.94.6 on CentOS7. The following steps have been tried
> in this order:
>
> 1) Reduce the min_size from 2 to 1 (as suggested by ceph health detail)
> 2) Set the 2 OSDs that were down to 'out' (one by one) and waited for
> the cluster to recover. (this did not work, I set them back in)
> 3) use ceph-objectstore-tool to export the pg from the 2 osds that went
> down, then removed it, restarted the osds.
> 4) When this did not work, import the data exported from the unaffected
> OSD into the two remaining osds.
> 5) Import the data from the unaffected OSD into all osds that are noted
> in "probing_osds"
>
> None of these had any effect on the stuck incomplete PG. I have
> attached the output of "ceph pg 54.3e9 query", "ceph health detail", as
> well as "ceph -s"
>
> The pool in question is largely read-only (it is an openstack rbd image
> pool) so I can leave it like this for the time being. Help would be
> very much appreciated!
>
> Thank you,
>
> - Hein-Pieter van Braam
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coded Pools Cause Irretrievable Objects and Possible Corruption

2016-05-05 Thread Samuel Just
Can you reproduce with

debug ms = 1
debug objecter = 20

on the radosgw side?
-Sam

On Thu, May 5, 2016 at 8:28 AM, Brian Felton  wrote:
> Greetings,
>
> We are running a number of Ceph clusters in production to provide object
> storage services.  We have stumbled upon an issue where objects of certain
> sizes are irretrievable.  The symptoms are very similar to the fix
> referenced here:
> https://www.redhat.com/archives/rhsa-announce/2015-November/msg00060.html.
> We can put objects into the cluster via s3/radosgw, but we cannot retrieve
> them (cluster closes the connection without delivering all bytes).
> Unfortunately, this fix does not apply to us, as we are and have always been
> running Hammer.  We've stumbled on a brand-new edge case.
>
> We have produced this issue on the 0.94.3, 0.94.4, and 0.94.6 releases of
> Hammer.
>
> We have produced this issues using three different storage hardware
> configurations -- 5 instances of clusters running 648 6TB OSDs across nine
> physical nodes, 1 cluster running 30 10GB OSDs across ten VM nodes, and 1
> cluster running 288 6TB OSDs across four physical nodes.
>
> We have determined that this issue only occurs when using erasure coding
> (we've only tested plugin=jerasure technique=reed_sol_van
> ruleset-failure-domain=host).
>
> Objects of exactly 4.5MiB (4718592 bytes) can be placed into the cluster but
> not retrieved.  At every interval of `rgw object stripe size` thereafter (in
> our case, 4 MiB), the objects are similarly irretrievable.  We have tested
> this from 4.5 to 24.5 MiB, then have spot-checked for much larger values to
> prove the pattern holds.  There is a small range of bytes less than this
> boundary that are irretrievable.  After much testing, we have found this
> boundary to be strongly correlated with the k value in our erasure coded
> pool.  We have observed that the m value in the erasure coding has no effect
> on the window size.  We have tested erasure coded values of k from 2 to 9,
> and we've observed the following ranges:
>
> k = 2, m = 1 -> No error
> k = 3, m = 1 -> 32 bytes (i.e. errors when objects are inclusively between
> 4718561 - 4718592 bytes)
> k = 3, m = 2 -> 32 bytes
> k = 4, m = 2 -> No error
> k = 4, m = 1 -> No error
> k = 5, m = 4 -> 128 bytes
> k = 6, m = 3 -> 512 bytes
> k = 6, m = 2 -> 512 bytes
> k = 7, m = 1 -> 800 bytes
> k = 7, m = 2 -> 800 bytes
> k = 8, m = 1 -> No error
> k = 9, m = 1 -> 800 bytes
>
> The "bytes" represent a 'dead zone' object size range wherein objects can be
> put into the cluster but not retrieved.  The range of bytes is 4.5MiB -
> (4.5MiB - buffer - 1) bytes. Up until k = 9, the error occurs for values of
> k that are not powers of two, at which point the "dead zone" window is
> (k-2)^2 * 32 bytes.  My team has not been able to determine why we plateau
> at 800 bytes (we expected a range of 1568 bytes here).
>
> This issue cannot be reproduced using rados to place objects directly into
> EC pools.  The issue has only been observed with using RadosGW's S3
> interface.
>
> The issue can be reproduced with any S3 client (s3cmd, s3curl, CyberDuck,
> CloudBerry Backup, and many others have been tested).
>
> At this point, we are evaluating the Ceph codebase in an attempt to patch
> the issue.  As this is an issue affecting data retrievability (and possibly
> integrity), we wanted to bring this to the attention of the community as
> soon as we could reproduce the issue.  We are hoping both that others out
> there can independently verify and possibly that some with a more intimate
> understanding of the codebase could investigate and propose a fix.  We have
> observed this issue in our production clusters, so it is a very high
> priority for my team.
>
> Furthermore, we believe the objects to be corrupted at the point they are
> placed into the cluster.  We have tested copying the .rgw.buckets pool to a
> non-erasure coded pool, then swapping names, and we have found that objects
> copied from the EC pool to the non-EC pool to be irretrievable once RGW is
> pointed to the non-EC pool.  If we overwrite the object in the non-EC pool
> with the original, it becomes retrievable again.  This has not been tested
> as exhaustively, though, but we felt it important enough to mention.
>
> I'm sure I've omitted some details here that would aid in an investigation,
> so please let me know what other information I can provide.  My team will be
> filing an issue shortly.
>
> Many thanks,
>
> Brian Felton
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Crashes

2016-04-29 Thread Samuel Just
You could strace the process to see precisely what ceph-osd is doing
to provoke the EIO.
-Sam

On Fri, Apr 29, 2016 at 9:03 AM, Somnath Roy <somnath@sandisk.com> wrote:
> Check system log and search for the corresponding drive. It should have the 
> information what is failing..
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Garg, Pankaj
> Sent: Friday, April 29, 2016 8:59 AM
> To: Samuel Just
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD Crashes
>
> I can see that. I guess what would that be symptomatic of? How is it doing 
> that on 6 different systems and on multiple OSDs?
>
> -Original Message-
> From: Samuel Just [mailto:sj...@redhat.com]
> Sent: Friday, April 29, 2016 8:57 AM
> To: Garg, Pankaj
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD Crashes
>
> Your fs is throwing an EIO on open.
> -Sam
>
> On Fri, Apr 29, 2016 at 8:54 AM, Garg, Pankaj 
> <pankaj.g...@caviumnetworks.com> wrote:
>> Hi,
>>
>> I had a fully functional Ceph cluster with 3 x86 Nodes and 3 ARM64
>> nodes, each with 12 HDD Drives and 2SSD Drives. All these were
>> initially running Hammer, and then were successfully updated to Infernalis 
>> (9.2.0).
>>
>> I recently deleted all my OSDs and swapped my drives with new ones on
>> the
>> x86 Systems, and the ARM servers were swapped with different ones
>> (keeping drives same).
>>
>> I again provisioned the OSDs, keeping the same cluster and Ceph
>> versions as before. But now, every time I try to run RADOS bench, my
>> OSDs start crashing (on both ARM and x86 servers).
>>
>> I’m not sure why this is happening on all 6 systems. On the x86, it’s
>> the same Ceph bits as before, and the only thing different is the new drives.
>>
>> It’s the same stack (pasted below) on all the OSDs too.
>>
>> Can anyone provide any clues?
>>
>>
>>
>> Thanks
>>
>> Pankaj
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>   -14> 2016-04-28 08:09:45.423950 7f1ef05b1700  1 --
>> 192.168.240.117:6820/14377 <== osd.93 192.168.240.116:6811/47080 1236
>> 
>> osd_repop(client.2794263.0:37721 284.6d4
>> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
>> 12284'26) v1  981+0+4759 (3923326827 0 3705383247) 0x5634cbabc400
>> con 0x5634c5168420
>>
>>-13> 2016-04-28 08:09:45.423981 7f1ef05b1700  5 -- op tracker -- seq:
>> 29404, time: 2016-04-28 08:09:45.423882, event: header_read, op:
>> osd_repop(client.2794263.0:37721 284.6d4
>> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
>> 12284'26)
>>
>>-12> 2016-04-28 08:09:45.423991 7f1ef05b1700  5 -- op tracker -- seq:
>> 29404, time: 2016-04-28 08:09:45.423884, event: throttled, op:
>> osd_repop(client.2794263.0:37721 284.6d4
>> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
>> 12284'26)
>>
>>-11> 2016-04-28 08:09:45.423996 7f1ef05b1700  5 -- op tracker -- seq:
>> 29404, time: 2016-04-28 08:09:45.423942, event: all_read, op:
>> osd_repop(client.2794263.0:37721 284.6d4
>> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
>> 12284'26)
>>
>>-10> 2016-04-28 08:09:45.424001 7f1ef05b1700  5 -- op tracker -- seq:
>> 29404, time: 0.00, event: dispatched, op:
>> osd_repop(client.2794263.0:37721 284.6d4
>> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
>> 12284'26)
>>
>> -9> 2016-04-28 08:09:45.424014 7f1ef05b1700  5 -- op tracker -- seq:
>> 29404, time: 2016-04-28 08:09:45.424014, event: queued_for_pg, op:
>> osd_repop(client.2794263.0:37721 284.6d4
>> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
>> 12284'26)
>>
>> -8> 2016-04-28 08:09:45.561827 7f1f15799700  5 osd.102 12284
>> tick_without_osd_lock
>>
>> -7> 2016-04-28 08:09:45.973944 7f1f0801a700  1 --
>> 192.168.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306
>>  osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2 
>> 47+0+0
>> (846632602 0 0) 0x5634c8305c00 con 0x5634c58dd760
>>
>> -6> 2016-04-28 08:09:45.973995 7f1f0801a700  1 --
>> 192.168.240.117:6821/14377 --> 192.168.240.115:0/26572 --
>> osd_ping(ping_reply e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0
>> 0x5634c7ba8000 con 0x5634c58dd760
>>
>> -5> 2016-04-28 08:09:45.974300 7f1f0981d700  1 --
>

Re: [ceph-users] OSD Crashes

2016-04-29 Thread Samuel Just
Your fs is throwing an EIO on open.
-Sam

On Fri, Apr 29, 2016 at 8:54 AM, Garg, Pankaj
 wrote:
> Hi,
>
> I had a fully functional Ceph cluster with 3 x86 Nodes and 3 ARM64 nodes,
> each with 12 HDD Drives and 2SSD Drives. All these were initially running
> Hammer, and then were successfully updated to Infernalis (9.2.0).
>
> I recently deleted all my OSDs and swapped my drives with new ones on the
> x86 Systems, and the ARM servers were swapped with different ones (keeping
> drives same).
>
> I again provisioned the OSDs, keeping the same cluster and Ceph versions as
> before. But now, every time I try to run RADOS bench, my OSDs start crashing
> (on both ARM and x86 servers).
>
> I’m not sure why this is happening on all 6 systems. On the x86, it’s the
> same Ceph bits as before, and the only thing different is the new drives.
>
> It’s the same stack (pasted below) on all the OSDs too.
>
> Can anyone provide any clues?
>
>
>
> Thanks
>
> Pankaj
>
>
>
>
>
>
>
>
>
>
>
>   -14> 2016-04-28 08:09:45.423950 7f1ef05b1700  1 --
> 192.168.240.117:6820/14377 <== osd.93 192.168.240.116:6811/47080 1236 
> osd_repop(client.2794263.0:37721 284.6d4
> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26) v1
>  981+0+4759 (3923326827 0 3705383247) 0x5634cbabc400 con 0x5634c5168420
>
>-13> 2016-04-28 08:09:45.423981 7f1ef05b1700  5 -- op tracker -- seq:
> 29404, time: 2016-04-28 08:09:45.423882, event: header_read, op:
> osd_repop(client.2794263.0:37721 284.6d4
> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26)
>
>-12> 2016-04-28 08:09:45.423991 7f1ef05b1700  5 -- op tracker -- seq:
> 29404, time: 2016-04-28 08:09:45.423884, event: throttled, op:
> osd_repop(client.2794263.0:37721 284.6d4
> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26)
>
>-11> 2016-04-28 08:09:45.423996 7f1ef05b1700  5 -- op tracker -- seq:
> 29404, time: 2016-04-28 08:09:45.423942, event: all_read, op:
> osd_repop(client.2794263.0:37721 284.6d4
> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26)
>
>-10> 2016-04-28 08:09:45.424001 7f1ef05b1700  5 -- op tracker -- seq:
> 29404, time: 0.00, event: dispatched, op:
> osd_repop(client.2794263.0:37721 284.6d4
> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26)
>
> -9> 2016-04-28 08:09:45.424014 7f1ef05b1700  5 -- op tracker -- seq:
> 29404, time: 2016-04-28 08:09:45.424014, event: queued_for_pg, op:
> osd_repop(client.2794263.0:37721 284.6d4
> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26)
>
> -8> 2016-04-28 08:09:45.561827 7f1f15799700  5 osd.102 12284
> tick_without_osd_lock
>
> -7> 2016-04-28 08:09:45.973944 7f1f0801a700  1 --
> 192.168.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306 
> osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2  47+0+0
> (846632602 0 0) 0x5634c8305c00 con 0x5634c58dd760
>
> -6> 2016-04-28 08:09:45.973995 7f1f0801a700  1 --
> 192.168.240.117:6821/14377 --> 192.168.240.115:0/26572 --
> osd_ping(ping_reply e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0
> 0x5634c7ba8000 con 0x5634c58dd760
>
> -5> 2016-04-28 08:09:45.974300 7f1f0981d700  1 --
> 10.18.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306 
> osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2  47+0+0
> (846632602 0 0) 0x5634c8129400 con 0x5634c58dcf20
>
> -4> 2016-04-28 08:09:45.974337 7f1f0981d700  1 --
> 10.18.240.117:6821/14377 --> 192.168.240.115:0/26572 -- osd_ping(ping_reply
> e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0 0x5634c617d600 con
> 0x5634c58dcf20
>
> -3> 2016-04-28 08:09:46.174079 7f1f11f92700  0
> filestore(/var/lib/ceph/osd/ceph-102) write couldn't open
> 287.6f9_head/287/ae33fef9/benchmark_data_ceph7_17591_object39895/head: (117)
> Structure needs cleaning
>
> -2> 2016-04-28 08:09:46.174103 7f1f11f92700  0
> filestore(/var/lib/ceph/osd/ceph-102)  error (117) Structure needs cleaning
> not handled on operation 0x5634c885df9e (16590.1.0, or op 0, counting from
> 0)
>
> -1> 2016-04-28 08:09:46.174109 7f1f11f92700  0
> filestore(/var/lib/ceph/osd/ceph-102) unexpected error code
>
>  0> 2016-04-28 08:09:46.178707 7f1f11791700 -1 os/FileStore.cc: In
> function 'int FileStore::lfn_open(coll_t, const ghobject_t&, bool, FDRef*,
> Index*)' thread 7f1f11791700 time 2016-04-28 08:09:46.173250
>
> os/FileStore.cc: 335: FAILED assert(!m_filestore_fail_eio || r != -5)
>
>
>
> ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
>
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0x5634c02ec7eb]
>
> 2: (FileStore::lfn_open(coll_t, ghobject_t const&, bool,
> std::shared_ptr*, Index*)+0x1191) [0x5634bffb2d01]
>
> 3: (FileStore::_write(coll_t, ghobject_t const&, unsigned long, unsigned
> long, ceph::buffer::list const&, unsigned int)+0xf0) [0x5634bffbb7b0]
>
> 4: 

Re: [ceph-users] help troubleshooting some osd communication problems

2016-04-28 Thread Samuel Just
I'd guess that to make any progress we'll need debug ms = 20 on both
sides of the connection when a message is lost.
-Sam

On Thu, Apr 28, 2016 at 2:38 PM, Mike Lovell  wrote:
> there was a problem on one of the clusters i manage a couple weeks ago where
> pairs of OSDs would wait indefinitely on subops from the other OSD in the
> pair. we used a liberal dose of "ceph osd down ##" on the osds and
> eventually things just sorted them out a couple days later.
>
> it seems to have come back today and co-workers and i are stuck on trying to
> figure out why this is happening. here are the details that i know.
> currently 2 OSDs, 41 and 148, keep waiting on subops from each other
> resulting in lines such as the following in ceph.log.
>
> 2016-04-28 13:29:26.875797 osd.41 10.217.72.22:6802/3769 56283 : cluster
> [WRN] slow request 30.642736 seconds old, received at 2016-04-28
> 13:28:56.233001: osd_op(client.11172360.0:516946146
> rbd_data.36bfe359c4998.0d08 [set-alloc-hint object_size 4194304
> write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=1
> ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
> waiting for subops from 5,140,148
>
> 2016-04-28 13:29:28.031452 osd.148 10.217.72.11:6820/6487 25324 : cluster
> [WRN] slow request 30.960922 seconds old, received at 2016-04-28
> 13:28:57.070471: osd_op(client.24127500.0:2960618
> rbd_data.38178d8adeb4d.10f8 [set-alloc-hint object_size 8388608
> write_size 8388608,write 3194880~4096] 17.fb41a37c RETRY=1
> ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
> waiting for subops from 41,115
>
> from digging in the logs, it appears like some messages are being lost
> between the OSDs. this is what osd.41 sees:
> -
> 2016-04-28 13:28:56.233702 7f3b171e0700  1 -- 10.217.72.22:6802/3769 <==
> client.11172360 10.217.72.41:0/6031968 6 
> osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0d08
> [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360]
> 17.3df49873 RETRY=1 ack+ondisk+retry+write+redirected+known_if_redirected
> e159001) v5  236+0+143360 (781016428 0 3953649960) 0x1d551c00 con
> 0x1a78d9c0
> 2016-04-28 13:28:56.233983 7f3b49020700  1 -- 10.217.89.22:6825/313003769
> --> 10.217.89.18:6806/1010441 -- osd_repop(client.11172360.0:516946146 17.73
> 3df49873/rbd_data.36bfe359c4998.0d08/head//17 v 159001'26722799)
> v1 -- ?+46 0x1d6db200 con 0x21add440
> 2016-04-28 13:28:56.234017 7f3b49020700  1 -- 10.217.89.22:6825/313003769
> --> 10.217.89.11:6810/4543 -- osd_repop(client.11172360.0:516946146 17.73
> 3df49873/rbd_data.36bfe359c4998.0d08/head//17 v 159001'26722799)
> v1 -- ?+46 0x1d6dd000 con 0x21ada000
> 2016-04-28 13:28:56.234046 7f3b49020700  1 -- 10.217.89.22:6825/313003769
> --> 10.217.89.11:6812/43006487 -- osd_repop(client.11172360.0:516946146
> 17.73 3df49873/rbd_data.36bfe359c4998.0d08/head//17 v
> 159001'26722799) v1 -- ?+144137 0x14becc00 con 0xf2cd4a0
> 2016-04-28 13:28:56.243555 7f3b35976700  1 -- 10.217.89.22:6825/313003769
> <== osd.140 10.217.89.11:6810/4543 23 
> osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1
>  83+0+0 (494696391 0 0) 0x28ea7b00 con 0x21ada000
> 2016-04-28 13:28:56.257816 7f3b27d9b700  1 -- 10.217.89.22:6825/313003769
> <== osd.5 10.217.89.18:6806/1010441 35 
> osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1
>  83+0+0 (2393425574 0 0) 0xfe82fc0 con 0x21add440
>
>
> this, however is what osd.148 sees:
> -
> [ulhglive-root@ceph1 ~]# grep :516946146 /var/log/ceph/ceph-osd.148.log
> 2016-04-28 13:29:33.470156 7f195fcfc700  1 -- 10.217.72.11:6820/6487 <==
> client.11172360 10.217.72.41:0/6031968 460 
> osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0d08
> [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360]
> 17.3df49873 RETRY=2 ack+ondisk+retry+write+redirected+known_if_redirected
> e159002) v5  236+0+143360 (129493315 0 3953649960) 0x1edf2300 con
> 0x24dc0d60
>
> also, due to the ceph osd down commands, there is recovery that needs to
> happen for a PG shared between these OSDs that is never making any progress.
> its probably due to whatever is cause the repops to fail.
>
> i did some tcpdump on both sides limiting things to the ip addresses and
> ports being used by these two OSDs and see packets flowing between the two
> osds. i attempted to have wireshark decode the actual ceph traffic but it
> was only able to get bits and pieces of the ceph protocol bits but at least
> for the moment i'm blaming that on the ceph dissector for wireshark. there
> aren't any dropped or error packets on any of the network interfaces
> involved.
>
> does anyone have any ideas of where to look next or other tips for this?
> we've put debug_ms and debug_osd 

Re: [ceph-users] CEPH All OSD got segmentation fault after CRUSH edit

2016-04-26 Thread Samuel Just
I think?  Probably worth reproducing on a vstart cluster to validate
the fix.  Didn't we introduce something in the mon to validate new
crushmaps?  Hammer maybe?
-Sam

On Tue, Apr 26, 2016 at 8:09 AM, Wido den Hollander <w...@42on.com> wrote:
>
>> Op 26 april 2016 om 16:58 schreef Samuel Just <sj...@redhat.com>:
>>
>>
>> Can you attach the OSDMap (ceph osd getmap -o )?
>> -Sam
>>
>
> Henrik contacted me to look at this and this is what I found:
>
> 0x00b18b81 in crush_choose_firstn (map=map@entry=0x1f00200, 
> bucket=0x0, weight=weight@entry=0x1f2b880, weight_max=weight_max@entry=30, 
> x=x@entry=1731224833, numrep=2, type=1, out=0x7fffdc036508, outpos=0, 
> out_size=2, tries=51, recurse_tries=1, local_retries=0,
> local_fallback_retries=0, recurse_to_leaf=1, vary_r=0, 
> out2=0x7fffdc036510, parent_r=0) at crush/mapper.c:345
> 345 crush/mapper.c: No such file or directory.
>
> A bit more output from GDB:
>
> #0  0x00b18b81 in crush_choose_firstn (map=map@entry=0x1f00200, 
> bucket=0x0, weight=weight@entry=0x1f2b880, weight_max=weight_max@entry=30, 
> x=x@entry=1731224833, numrep=2, type=1, out=0x7fffdc036508, outpos=0, 
> out_size=2, tries=51, recurse_tries=1, local_retries=0,
> local_fallback_retries=0, recurse_to_leaf=1, vary_r=0, 
> out2=0x7fffdc036510, parent_r=0) at crush/mapper.c:345
> #1  0x00b194cb in crush_do_rule (map=0x1f00200, ruleno= out>, x=1731224833, result=0x7fffdc036520, result_max=, 
> weight=0x1f2b880, weight_max=30, scratch=) at 
> crush/mapper.c:794
> #2  0x00a61680 in do_rule (weight=std::vector of length 30, capacity 
> 30 = {...}, maxout=2, out=std::vector of length 0, capacity 0, x=1731224833, 
> rule=0, this=0x1f72340) at ./crush/CrushWrapper.h:939
> #3  OSDMap::_pg_to_osds (this=this@entry=0x1f46800, pool=..., pg=..., 
> osds=osds@entry=0x7fffdc036600, primary=primary@entry=0x7fffdc0365ec, 
> ppps=0x7fffdc0365f4) at osd/OSDMap.cc:1417
>
> It seems that CRUSH can't find entries in the CRUSHMap. In this case the 
> 'root default' was removed while the default ruleset still refers to it.
>
> The cluster is running 0.80.11
>
> I extracted the CRUSHMaps from the OSDMaps on osd.0:
>
> $ for i in {1392..1450}; do find -name "osdmap*${i}*" -exec osdmaptool 
> --export-crush /tmp/crush.${i} {} \;; crushtool -d /tmp/crush.${i} -o 
> /tmp/crush.${i}.txt; done
>
> Here I see that in map 1433 the root 'default' doesn't exist, but the crush 
> ruleset refers to 'bucket0'. This crushmap is attached.
>
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take bucket0
> step chooseleaf firstn 0 type host
> step emit
> }
>
> The root bucket0 doesn't exist.
>
> bucket0 seems like something which was created by Ceph/CRUSH and not by the 
> user.
>
> I'm thinking about injecting a fixed CRUSHMap into this OSDMap where bucket0 
> does exist. Does that seem like a sane thing to do?
>
> Wido
>
>
>> On Tue, Apr 26, 2016 at 2:07 AM, Henrik Svensson <henrik.svens...@sectra.com
>> > wrote:
>>
>> > Hi!
>> >
>> > We got a three node CEPH cluster with 10 OSD each.
>> >
>> > We bought 3 new machines with additional 30 disks that should reside in
>> > another location.
>> > Before adding these machines we modified the default CRUSH table.
>> >
>> > After modifying the (default) crush table with these commands the cluster
>> > went down:
>> >
>> > 
>> > ceph osd crush add-bucket dc1 datacenter
>> > ceph osd crush add-bucket dc2 datacenter
>> > ceph osd crush add-bucket availo datacenter
>> > ceph osd crush move dc1 root=default
>> > ceph osd crush move lkpsx0120 root=default datacenter=dc1
>> > ceph osd crush move lkpsx0130 root=default datacenter=dc1
>> > ceph osd crush move lkpsx0140 root=default datacenter=dc1
>> > ceph osd crush move dc2 root=default
>> > ceph osd crush move availo root=default
>> > ceph osd crush add-bucket sectra root
>> > ceph osd crush move dc1 root=sectra
>> > ceph osd crush move dc2 root=sectra
>> > ceph osd crush move dc3 root=sectra
>> > ceph osd crush move availo root=sectra
>> > ceph osd crush remove default
>> > 
>> >
>> > I tried to revert the CRUSH map but no luck:
>> >
>> > 
>> > ceph osd crush add-bucket default root
>> > ceph osd crush move lkpsx0120 root=default
>> > ceph osd crush move lkpsx0130 root=defa

Re: [ceph-users] CEPH All OSD got segmentation fault after CRUSH edit

2016-04-26 Thread Samuel Just
Can you attach the OSDMap (ceph osd getmap -o )?
-Sam

On Tue, Apr 26, 2016 at 2:07 AM, Henrik Svensson  wrote:

> Hi!
>
> We got a three node CEPH cluster with 10 OSD each.
>
> We bought 3 new machines with additional 30 disks that should reside in
> another location.
> Before adding these machines we modified the default CRUSH table.
>
> After modifying the (default) crush table with these commands the cluster
> went down:
>
> 
> ceph osd crush add-bucket dc1 datacenter
> ceph osd crush add-bucket dc2 datacenter
> ceph osd crush add-bucket availo datacenter
> ceph osd crush move dc1 root=default
> ceph osd crush move lkpsx0120 root=default datacenter=dc1
> ceph osd crush move lkpsx0130 root=default datacenter=dc1
> ceph osd crush move lkpsx0140 root=default datacenter=dc1
> ceph osd crush move dc2 root=default
> ceph osd crush move availo root=default
> ceph osd crush add-bucket sectra root
> ceph osd crush move dc1 root=sectra
> ceph osd crush move dc2 root=sectra
> ceph osd crush move dc3 root=sectra
> ceph osd crush move availo root=sectra
> ceph osd crush remove default
> 
>
> I tried to revert the CRUSH map but no luck:
>
> 
> ceph osd crush add-bucket default root
> ceph osd crush move lkpsx0120 root=default
> ceph osd crush move lkpsx0130 root=default
> ceph osd crush move lkpsx0140 root=default
> ceph osd crush remove sectra
> 
>
> After trying to restart the cluster (and even the machines) no OSD started
> up again.
> But ceph osd tree gave this output, stating certain OSD:s are up (but the
> processes are not running):
>
> 
> # id weight type name up/down reweight
> -1 163.8 root default
> -2 54.6 host lkpsx0120
> 0 5.46 osd.0 down 0
> 1 5.46 osd.1 down 0
> 2 5.46 osd.2 down 0
> 3 5.46 osd.3 down 0
> 4 5.46 osd.4 down 0
> 5 5.46 osd.5 down 0
> 6 5.46 osd.6 down 0
> 7 5.46 osd.7 down 0
> 8 5.46 osd.8 down 0
> 9 5.46 osd.9 down 0
> -3 54.6 host lkpsx0130
> 10 5.46 osd.10 down 0
> 11 5.46 osd.11 down 0
> 12 5.46 osd.12 down 0
> 13 5.46 osd.13 down 0
> 14 5.46 osd.14 down 0
> 15 5.46 osd.15 down 0
> 16 5.46 osd.16 down 0
> 17 5.46 osd.17 down 0
> 18 5.46 osd.18 up 1
> 19 5.46 osd.19 up 1
> -4 54.6 host lkpsx0140
> 20 5.46 osd.20 up 1
> 21 5.46 osd.21 down 0
> 22 5.46 osd.22 down 0
> 23 5.46 osd.23 down 0
> 24 5.46 osd.24 down 0
> 25 5.46 osd.25 up 1
> 26 5.46 osd.26 up 1
> 27 5.46 osd.27 up 1
> 28 5.46 osd.28 up 1
> 29 5.46 osd.29 up 1
> 
>
> The monitor starts/restarts OK (only one monitor exists).
> But when starting one OSD with ceph -w nothing shows.
>
> Here is the ceph mon_status:
>
> 
> { "name": "lkpsx0120",
>   "rank": 0,
>   "state": "leader",
>   "election_epoch": 1,
>   "quorum": [
> 0],
>   "outside_quorum": [],
>   "extra_probe_peers": [],
>   "sync_provider": [],
>   "monmap": { "epoch": 4,
>   "fsid": "9244194a-5e10-47ae-9287-507944612f95",
>   "modified": "0.00",
>   "created": "0.00",
>   "mons": [
> { "rank": 0,
>   "name": "lkpsx0120",
>   "addr": "10.15.2.120:6789\/0"}]}}
> 
>
> Here is the ceph.conf file
>
> 
> [global]
> fsid = 9244194a-5e10-47ae-9287-507944612f95
> mon_initial_members = lkpsx0120
> mon_host = 10.15.2.120
> #debug osd = 20
> #debug ms = 1
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> osd_crush_chooseleaf_type = 1
> osd_pool_default_size = 2
> public_network = 10.15.2.0/24
> cluster_network = 10.15.4.0/24
> rbd_cache = true
> rbd_cache_size = 67108864
> rbd_cache_max_dirty = 50331648
> rbd_cache_target_dirty = 33554432
> rbd_cache_max_dirty_age = 2
> rbd_cache_writethrough_until_flush = true
> 
>
> Here is the decompiled crush map:
>
> 
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
> device 23 osd.23
> device 24 osd.24
> device 25 osd.25
> device 26 osd.26
> device 27 osd.27
> device 28 osd.28
> device 29 osd.29
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> host lkpsx0120 {
> id -2 # do not change unnecessarily
> # weight 54.600
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 5.460
> item osd.1 weight 5.460
> item osd.2 weight 

Re: [ceph-users] Deprecating ext4 support

2016-04-14 Thread Samuel Just
It doesn't seem like it would be wise to run such systems on top of rbd.
-Sam

On Thu, Apr 14, 2016 at 11:05 AM, Jianjian Huo  wrote:
> On Wed, Apr 13, 2016 at 6:06 AM, Sage Weil  wrote:
>> On Tue, 12 Apr 2016, Jan Schermer wrote:
>>> Who needs to have exactly the same data in two separate objects
>>> (replicas)? Ceph needs it because "consistency"?, but the app (VM
>>> filesystem) is fine with whatever version because the flush didn't
>>> happen (if it did the contents would be the same).
>>
>> While we're talking/thinking about this, here's a simple example of why
>> the simple solution (let the replicas be out of sync), which seems
>> reasonable at first, can blow up in your face.
>>
>> If a disk block contains A and you write B over the top of it and then
>> there is a failure (e.g. power loss before you issue a flush), it's okay
>> for the disk to contain either A or B.  In a replicated system, let's say
>> 2x mirroring (call them R1 and R2), you might end up with B on R1 and A
>> on R2.  If you don't immediately clean it up, then at some point down the
>> line you might switch from reading R1 to reading R2 and the disk block
>> will go "back in time" (previously you read B, now you read A).  A
>> single disk/replica will never do that, and applications can break.
>>
>> For example, if the block in question is a journal block, we might see B
>> the first time (valid journal!), the do a bunch of work and
>> journal/write new stuff to the blocks that follow.  Then we lose
>> power again, lose R1, replay the journal, read A from R2, and stop journal
>> replay early... missing out on all the new stuff.  This can easily corrupt
>> a file system or database or whatever else.
>
> If data is critical, applications use their own replicas, MySQL,
> Cassandra, MongoDB... if above scenario happens and one replica is out
> of sync, they use quorum like protocol to guarantee reading the latest
> data, and repair those out-of-sync replicas. so eventual consistency
> in storage is acceptable for them?
>
> Jianjian
>>
>> It might sound unlikely, but keep in mind that writes to these
>> all-important metadata and commit blocks are extremely frequent.  It's the
>> kind of thing you can usually get away with, until you don't, and then you
>> have a very bad day...
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrubbing a lot

2016-03-29 Thread Samuel Just
What's the kernel version?
-Sam

On Tue, Mar 29, 2016 at 1:33 PM, German Anders <gand...@despegar.com> wrote:
> On the host:
>
> # ceph --cluster cephIB --version
> ceph version 10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)
>
> # rbd --version
> ceph version 10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)
>
> If I run the command without root or sudo the command failed with a
> Permission denied
>
> German
>
> 2016-03-29 17:24 GMT-03:00 Samuel Just <sj...@redhat.com>:
>>
>> Or you needed to run it as root?
>> -Sam
>>
>> On Tue, Mar 29, 2016 at 1:24 PM, Samuel Just <sj...@redhat.com> wrote:
>> > Sounds like a version/compatibility thing.  Are your rbd clients really
>> > old?
>> > -Sam
>> >
>> > On Tue, Mar 29, 2016 at 1:19 PM, German Anders <gand...@despegar.com>
>> > wrote:
>> >> I've just upgrade to jewel, and the scrubbing seems to been
>> >> corrected... but
>> >> now I'm not able to map an rbd on a host (before I was able to),
>> >> basically
>> >> I'm getting this error msg:
>> >>
>> >> rbd: sysfs write failed
>> >> rbd: map failed: (5) Input/output error
>> >>
>> >> # rbd --cluster cephIB create host01 --size 102400 --pool
>> >> cinder-volumes -k
>> >> /etc/ceph/cephIB.client.cinder.keyring
>> >> # rbd --cluster cephIB map host01 --pool cinder-volumes -k
>> >> /etc/ceph/cephIB.client.cinder.keyring
>> >> rbd: sysfs write failed
>> >> rbd: map failed: (5) Input/output error
>> >>
>> >> Any ideas? on the /etc/ceph directory on the host I've:
>> >>
>> >> -rw-r--r-- 1 ceph ceph  92 Nov 17 15:45 rbdmap
>> >> -rw-r--r-- 1 ceph ceph 170 Dec 15 14:47 secret.xml
>> >> -rw-r--r-- 1 ceph ceph  37 Dec 15 15:12 virsh-secret
>> >> -rw-r--r-- 1 ceph ceph   0 Dec 15 15:12 virsh-secret-set
>> >> -rw-r--r-- 1 ceph ceph  37 Dec 21 14:53 virsh-secretIB
>> >> -rw-r--r-- 1 ceph ceph   0 Dec 21 14:53 virsh-secret-setIB
>> >> -rw-r--r-- 1 ceph ceph 173 Dec 22 13:34 secretIB.xml
>> >> -rw-r--r-- 1 ceph ceph 619 Dec 22 13:38 ceph.conf
>> >> -rw-r--r-- 1 ceph ceph  72 Dec 23 09:51 ceph.client.cinder.keyring
>> >> -rw-r--r-- 1 ceph ceph  63 Mar 28 09:03 cephIB.client.cinder.keyring
>> >> -rw-r--r-- 1 ceph ceph 526 Mar 28 12:06 cephIB.conf
>> >> -rw--- 1 ceph ceph  63 Mar 29 16:11 cephIB.client.admin.keyring
>> >>
>> >> Thanks in advance,
>> >>
>> >> Best,
>> >>
>> >> German
>> >>
>> >> 2016-03-29 14:45 GMT-03:00 German Anders <gand...@despegar.com>:
>> >>>
>> >>> Sure, also the scrubbing is happening on all the osds :S
>> >>>
>> >>> # ceph --cluster cephIB daemon osd.4 config diff
>> >>> {
>> >>> "diff": {
>> >>> "current": {
>> >>> "admin_socket": "\/var\/run\/ceph\/cephIB-osd.4.asok",
>> >>> "auth_client_required": "cephx",
>> >>> "filestore_fd_cache_size": "10240",
>> >>> "filestore_journal_writeahead": "true",
>> >>> "filestore_max_sync_interval": "10",
>> >>> "filestore_merge_threshold": "40",
>> >>> "filestore_op_threads": "20",
>> >>> "filestore_queue_max_ops": "10",
>> >>> "filestore_split_multiple": "8",
>> >>> "fsid": "a4bce51b-4d6b-4394-9737-3e4d9f5efed2",
>> >>> "internal_safe_to_start_threads": "true",
>> >>> "keyring": "\/var\/lib\/ceph\/osd\/cephIB-4\/keyring",
>> >>> "leveldb_log": "",
>> >>> "log_file": "\/var\/log\/ceph\/cephIB-osd.4.log",
>> >>> "log_to_stderr": "false",
>> >>> "mds_data": "\/var\/lib\/ceph\/mds\/cephIB-4",
>> >>> "mon_cluster_log_file":
>> >>> "default=\/var\/log\/ceph\/cephIB.$channel.log
>> >>> cluste

Re: [ceph-users] Scrubbing a lot

2016-03-29 Thread Samuel Just
Or you needed to run it as root?
-Sam

On Tue, Mar 29, 2016 at 1:24 PM, Samuel Just <sj...@redhat.com> wrote:
> Sounds like a version/compatibility thing.  Are your rbd clients really old?
> -Sam
>
> On Tue, Mar 29, 2016 at 1:19 PM, German Anders <gand...@despegar.com> wrote:
>> I've just upgrade to jewel, and the scrubbing seems to been corrected... but
>> now I'm not able to map an rbd on a host (before I was able to), basically
>> I'm getting this error msg:
>>
>> rbd: sysfs write failed
>> rbd: map failed: (5) Input/output error
>>
>> # rbd --cluster cephIB create host01 --size 102400 --pool cinder-volumes -k
>> /etc/ceph/cephIB.client.cinder.keyring
>> # rbd --cluster cephIB map host01 --pool cinder-volumes -k
>> /etc/ceph/cephIB.client.cinder.keyring
>> rbd: sysfs write failed
>> rbd: map failed: (5) Input/output error
>>
>> Any ideas? on the /etc/ceph directory on the host I've:
>>
>> -rw-r--r-- 1 ceph ceph  92 Nov 17 15:45 rbdmap
>> -rw-r--r-- 1 ceph ceph 170 Dec 15 14:47 secret.xml
>> -rw-r--r-- 1 ceph ceph  37 Dec 15 15:12 virsh-secret
>> -rw-r--r-- 1 ceph ceph   0 Dec 15 15:12 virsh-secret-set
>> -rw-r--r-- 1 ceph ceph  37 Dec 21 14:53 virsh-secretIB
>> -rw-r--r-- 1 ceph ceph   0 Dec 21 14:53 virsh-secret-setIB
>> -rw-r--r-- 1 ceph ceph 173 Dec 22 13:34 secretIB.xml
>> -rw-r--r-- 1 ceph ceph 619 Dec 22 13:38 ceph.conf
>> -rw-r--r-- 1 ceph ceph  72 Dec 23 09:51 ceph.client.cinder.keyring
>> -rw-r--r-- 1 ceph ceph  63 Mar 28 09:03 cephIB.client.cinder.keyring
>> -rw-r--r-- 1 ceph ceph 526 Mar 28 12:06 cephIB.conf
>> -rw--- 1 ceph ceph  63 Mar 29 16:11 cephIB.client.admin.keyring
>>
>> Thanks in advance,
>>
>> Best,
>>
>> German
>>
>> 2016-03-29 14:45 GMT-03:00 German Anders <gand...@despegar.com>:
>>>
>>> Sure, also the scrubbing is happening on all the osds :S
>>>
>>> # ceph --cluster cephIB daemon osd.4 config diff
>>> {
>>> "diff": {
>>> "current": {
>>> "admin_socket": "\/var\/run\/ceph\/cephIB-osd.4.asok",
>>> "auth_client_required": "cephx",
>>> "filestore_fd_cache_size": "10240",
>>> "filestore_journal_writeahead": "true",
>>> "filestore_max_sync_interval": "10",
>>> "filestore_merge_threshold": "40",
>>> "filestore_op_threads": "20",
>>> "filestore_queue_max_ops": "10",
>>> "filestore_split_multiple": "8",
>>> "fsid": "a4bce51b-4d6b-4394-9737-3e4d9f5efed2",
>>> "internal_safe_to_start_threads": "true",
>>> "keyring": "\/var\/lib\/ceph\/osd\/cephIB-4\/keyring",
>>> "leveldb_log": "",
>>> "log_file": "\/var\/log\/ceph\/cephIB-osd.4.log",
>>> "log_to_stderr": "false",
>>> "mds_data": "\/var\/lib\/ceph\/mds\/cephIB-4",
>>> "mon_cluster_log_file":
>>> "default=\/var\/log\/ceph\/cephIB.$channel.log
>>> cluster=\/var\/log\/ceph\/cephIB.log",
>>> "mon_data": "\/var\/lib\/ceph\/mon\/cephIB-4",
>>> "mon_debug_dump_location":
>>> "\/var\/log\/ceph\/cephIB-osd.4.tdump",
>>> "mon_host": "172.23.16.1,172.23.16.2,172.23.16.3",
>>> "mon_initial_members": "cibm01, cibm02, cibm03",
>>> "osd_data": "\/var\/lib\/ceph\/osd\/cephIB-4",
>>> "osd_journal": "\/var\/lib\/ceph\/osd\/cephIB-4\/journal",
>>> "osd_op_threads": "8",
>>> "rgw_data": "\/var\/lib\/ceph\/radosgw\/cephIB-4",
>>> "setgroup": "ceph",
>>> "setuser": "ceph"
>>> },
>>> "defaults": {
>>> "admin_socket": "\/var\/run\/ceph\/ceph-osd.4.asok",
>>> "auth_client_required": "cephx, none",
>>> "filestore_fd_cache_size": "128",
>

Re: [ceph-users] Scrubbing a lot

2016-03-29 Thread Samuel Just
;: "2",
>> "fsid": "----",
>> "internal_safe_to_start_threads": "false",
>> "keyring":
>> "\/etc\/ceph\/ceph.osd.4.keyring,\/etc\/ceph\/ceph.keyring,\/etc\/ceph\/keyring,\/etc\/ceph\/keyring.bin",
>> "leveldb_log": "\/dev\/null",
>> "log_file": "\/var\/log\/ceph\/ceph-osd.4.log",
>> "log_to_stderr": "true",
>> "mds_data": "\/var\/lib\/ceph\/mds\/ceph-4",
>> "mon_cluster_log_file":
>> "default=\/var\/log\/ceph\/ceph.$channel.log
>> cluster=\/var\/log\/ceph\/ceph.log",
>> "mon_data": "\/var\/lib\/ceph\/mon\/ceph-4",
>> "mon_debug_dump_location":
>> "\/var\/log\/ceph\/ceph-osd.4.tdump",
>> "mon_host": "",
>> "mon_initial_members": "",
>> "osd_data": "\/var\/lib\/ceph\/osd\/ceph-4",
>> "osd_journal": "\/var\/lib\/ceph\/osd\/ceph-4\/journal",
>> "osd_op_threads": "2",
>> "rgw_data": "\/var\/lib\/ceph\/radosgw\/ceph-4",
>> "setgroup": "",
>> "setuser": ""
>> }
>> },
>> "unknown": []
>> }
>>
>>
>> Thanks a lot!
>>
>> Best,
>>
>>
>> German
>>
>> 2016-03-29 14:10 GMT-03:00 Samuel Just <sj...@redhat.com>:
>>>
>>> That seems to be scrubbing pretty often.  Can you attach a config diff
>>> from osd.4 (ceph daemon osd.4 config diff)?
>>> -Sam
>>>
>>> On Tue, Mar 29, 2016 at 9:30 AM, German Anders <gand...@despegar.com>
>>> wrote:
>>> > Hi All,
>>> >
>>> > I've maybe a simple question, I've setup a new cluster with Infernalis
>>> > release, there's no IO going on at the cluster level and I'm receiving
>>> > a lot
>>> > of these messages:
>>> >
>>> > 2016-03-29 12:22:07.462818 mon.0 [INF] pgmap v158062: 8192 pgs: 8192
>>> > active+clean; 20617 MB data, 46164 MB used, 52484 GB / 52529 GB avail
>>> > 2016-03-29 12:22:08.176684 osd.13 [INF] 0.d38 scrub starts
>>> > 2016-03-29 12:22:08.179841 osd.13 [INF] 0.d38 scrub ok
>>> > 2016-03-29 12:21:59.526355 osd.9 [INF] 0.8a6 scrub starts
>>> > 2016-03-29 12:21:59.529582 osd.9 [INF] 0.8a6 scrub ok
>>> > 2016-03-29 12:22:03.004107 osd.4 [INF] 0.38b scrub starts
>>> > 2016-03-29 12:22:03.007220 osd.4 [INF] 0.38b scrub ok
>>> > 2016-03-29 12:22:03.617706 osd.21 [INF] 0.525 scrub starts
>>> > 2016-03-29 12:22:03.621073 osd.21 [INF] 0.525 scrub ok
>>> > 2016-03-29 12:22:06.527264 osd.9 [INF] 0.8a6 scrub starts
>>> > 2016-03-29 12:22:06.529150 osd.9 [INF] 0.8a6 scrub ok
>>> > 2016-03-29 12:22:07.005628 osd.4 [INF] 0.38b scrub starts
>>> > 2016-03-29 12:22:07.009776 osd.4 [INF] 0.38b scrub ok
>>> > 2016-03-29 12:22:07.618191 osd.21 [INF] 0.525 scrub starts
>>> > 2016-03-29 12:22:07.621363 osd.21 [INF] 0.525 scrub ok
>>> >
>>> >
>>> > I mean, all the time, and AFAIK these is because the scrub operation is
>>> > like
>>> > an fsck on the object level, so this make me think that it's not a
>>> > normal
>>> > situation. Is there any command that I can run in order to check this?
>>> >
>>> > # ceph --cluster cephIB health detail
>>> > HEALTH_OK
>>> >
>>> >
>>> > Thanks in advance,
>>> >
>>> > Best,
>>> >
>>> > German
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrubbing a lot

2016-03-29 Thread Samuel Just
That seems to be scrubbing pretty often.  Can you attach a config diff
from osd.4 (ceph daemon osd.4 config diff)?
-Sam

On Tue, Mar 29, 2016 at 9:30 AM, German Anders  wrote:
> Hi All,
>
> I've maybe a simple question, I've setup a new cluster with Infernalis
> release, there's no IO going on at the cluster level and I'm receiving a lot
> of these messages:
>
> 2016-03-29 12:22:07.462818 mon.0 [INF] pgmap v158062: 8192 pgs: 8192
> active+clean; 20617 MB data, 46164 MB used, 52484 GB / 52529 GB avail
> 2016-03-29 12:22:08.176684 osd.13 [INF] 0.d38 scrub starts
> 2016-03-29 12:22:08.179841 osd.13 [INF] 0.d38 scrub ok
> 2016-03-29 12:21:59.526355 osd.9 [INF] 0.8a6 scrub starts
> 2016-03-29 12:21:59.529582 osd.9 [INF] 0.8a6 scrub ok
> 2016-03-29 12:22:03.004107 osd.4 [INF] 0.38b scrub starts
> 2016-03-29 12:22:03.007220 osd.4 [INF] 0.38b scrub ok
> 2016-03-29 12:22:03.617706 osd.21 [INF] 0.525 scrub starts
> 2016-03-29 12:22:03.621073 osd.21 [INF] 0.525 scrub ok
> 2016-03-29 12:22:06.527264 osd.9 [INF] 0.8a6 scrub starts
> 2016-03-29 12:22:06.529150 osd.9 [INF] 0.8a6 scrub ok
> 2016-03-29 12:22:07.005628 osd.4 [INF] 0.38b scrub starts
> 2016-03-29 12:22:07.009776 osd.4 [INF] 0.38b scrub ok
> 2016-03-29 12:22:07.618191 osd.21 [INF] 0.525 scrub starts
> 2016-03-29 12:22:07.621363 osd.21 [INF] 0.525 scrub ok
>
>
> I mean, all the time, and AFAIK these is because the scrub operation is like
> an fsck on the object level, so this make me think that it's not a normal
> situation. Is there any command that I can run in order to check this?
>
> # ceph --cluster cephIB health detail
> HEALTH_OK
>
>
> Thanks in advance,
>
> Best,
>
> German
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Samuel Just
Ok, like I said, most files with _long at the end are *not orphaned*.
The generation number also is *not* an indication of whether the file
is orphaned -- some of the orphaned files will have 
as the generation number and others won't.  For each long filename
object in a pg you would have to:
1) Pull the long name out of the attr
2) Parse the hash out of the long name
3) Turn that into a directory path
4) Determine whether the file is at the right place in the path
5) If not, remove it (or echo it to be checked)

You probably want to wait for someone to get around to writing a
branch for ceph-objectstore-tool.  Should happen in the next week or
two.
-Sam

On Wed, Mar 16, 2016 at 1:36 PM, Jeffrey McDonald  wrote:
> Hi Sam,
>
> I've written a script but i'm a little leary of unleasing it until I find a
> few more cases to test.   The script successfully removed the file mentioned
> above.
> I took the next pg which was marked inconsistent and ran the following
> command over those pg directory structures:
>
> find . -name "*_long" -exec xattr -p user.cephos.lfn3 {} +  | grep -v
> 
>
> I didn't find any files that "orphaned" by this command.   All of these
> files should have "_long" and the grep should pull out the invalid
> generation, correct?
>
> I'm looking wider but in the next pg marked inconsistent I didn't find any
> orphans.
>
> Thanks,
> Jeff
>
> --
>
> Jeffrey McDonald, PhD
> Assistant Director for HPC Operations
> Minnesota Supercomputing Institute
> University of Minnesota Twin Cities
> 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
> 117 Pleasant St SE   phone: +1 612 625-6905
> Minneapolis, MN 55455fax:   +1 612 624-8861
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Samuel Just
Oh, it's getting a stat mismatch.  I think what happened is that on
one of the earlier repairs it reset the stats to the wrong value (the
orphan was causing the primary to scan two objects twice, which
matches the stat mismatch I see here).  A pg repair repair will clear
that up.
-Sam

On Thu, Mar 17, 2016 at 9:22 AM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> Thanks Sam.
>
> Since I have prepared a script for this, I decided to go ahead with the
> checks.(patience isn't one of my extended attributes)
>
> I've got a file that searches the full erasure encoded spaces and does your
> checklist below.   I have operated only on one PG so far, the 70.459 one
> that we've been discussing.There was only the one file that I found to
> be out of place--the one we already discussed/found and it has been removed.
>
> The pg is still marked as inconsistent.   I've scrubbed it a couple of times
> now and what I've seen is:
>
> 2016-03-17 09:29:53.202818 7f2e816f8700  0 log_channel(cluster) log [INF] :
> 70.459 deep-scrub starts
> 2016-03-17 09:36:38.436821 7f2e816f8700 -1 log_channel(cluster) log [ERR] :
> 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones,
> 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
> 68440088914/68445454633 bytes,0/0 hit_set_archive bytes.
> 2016-03-17 09:36:38.436844 7f2e816f8700 -1 log_channel(cluster) log [ERR] :
> 70.459 deep-scrub 1 errors
> 2016-03-17 09:44:23.592302 7f2e816f8700  0 log_channel(cluster) log [INF] :
> 70.459 deep-scrub starts
> 2016-03-17 09:47:01.237846 7f2e816f8700 -1 log_channel(cluster) log [ERR] :
> 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones,
> 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
> 68440088914/68445454633 bytes,0/0 hit_set_archive bytes.
> 2016-03-17 09:47:01.237880 7f2e816f8700 -1 log_channel(cluster) log [ERR] :
> 70.459 deep-scrub 1 errors
>
>
> Should the scrub be sufficient to remove the inconsistent flag?   I took the
> osd offline during the repairs.I've looked at files in all of the osds
> in the placement group and I'm not finding any more problem files.The
> vast majority of files do not have the user.cephos.lfn3 attribute.There
> are 22321 objects that I seen and only about 230 have the user.cephos.lfn3
> file attribute.   The files will have other attributes, just not
> user.cephos.lfn3.
>
> Regards,
> Jeff
>
>
> On Wed, Mar 16, 2016 at 3:53 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> Ok, like I said, most files with _long at the end are *not orphaned*.
>> The generation number also is *not* an indication of whether the file
>> is orphaned -- some of the orphaned files will have 
>> as the generation number and others won't.  For each long filename
>> object in a pg you would have to:
>> 1) Pull the long name out of the attr
>> 2) Parse the hash out of the long name
>> 3) Turn that into a directory path
>> 4) Determine whether the file is at the right place in the path
>> 5) If not, remove it (or echo it to be checked)
>>
>> You probably want to wait for someone to get around to writing a
>> branch for ceph-objectstore-tool.  Should happen in the next week or
>> two.
>> -Sam
>>
>
> --
>
> Jeffrey McDonald, PhD
> Assistant Director for HPC Operations
> Minnesota Supercomputing Institute
> University of Minnesota Twin Cities
> 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
> 117 Pleasant St SE   phone: +1 612 625-6905
> Minneapolis, MN 55455fax:   +1 612 624-8861
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Samuel Just
x 2 root root 16384 Feb 10 00:01 DIR_C
> drwxr-xr-x 2 root root 16384 Mar  4 10:50 DIR_7
> drwxr-xr-x 2 root root 16384 Mar  4 16:46 DIR_A
> drwxr-xr-x 2 root root 16384 Mar  5 02:37 DIR_2
> drwxr-xr-x 2 root root 16384 Mar  5 17:39 DIR_4
> drwxr-xr-x 2 root root 16384 Mar  8 16:50 DIR_F
> drwxr-xr-x 2 root root 16384 Mar 15 15:51 DIR_8
> drwxr-xr-x 2 root root 16384 Mar 15 21:18 DIR_D
> drwxr-xr-x 2 root root 16384 Mar 15 22:25 DIR_0
> drwxr-xr-x 2 root root 16384 Mar 15 22:35 DIR_9
> drwxr-xr-x 2 root root 16384 Mar 15 22:56 DIR_E
> drwxr-xr-x 2 root root 16384 Mar 15 23:21 DIR_1
> drwxr-xr-x 2 root root 12288 Mar 16 00:07 DIR_B
> drwxr-xr-x 2 root root 16384 Mar 16 00:34 DIR_5
>
> I assume that this file is an issue as well..and needs to be removed.
>
>
> then, in the directory where the file should be, I have the same file:
>
> root@ceph03:/var/lib/ceph/osd/ceph-307/current/70.459s0_head/DIR_9/DIR_5/DIR_4/DIR_D/DIR_E#
> ls -ltr | grep -v __head_
> total 64840
> -rw-r--r-- 1 root root 1048576 Jan 23 21:49
> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>
> In the directory DIR_E here (from above), there is only one file without a
> __head_ in the pathname -- the file aboveShould I be deleting both these
> _long files without the __head_ in DIR_E and in one above .../DIR_E?
>
> Since there is no directory structure HASH in these files, is that the
> indication that it is an orphan?
>
> Thanks,
> Jeff
>
>
>
>
> On Tue, Mar 15, 2016 at 8:38 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> Ah, actually, I think there will be duplicates only around half the
>> time -- either the old link or the new link could be orphaned
>> depending on which xfs decides to list first.  Only if the old link is
>> orphaned will it match the name of the object once it's recreated.  I
>> should be able to find time to put together a branch in the next week
>> or two if you want to wait.  It's still probably worth trying removing
>> that object in 70.459.
>> -Sam
>>
>> On Tue, Mar 15, 2016 at 6:03 PM, Samuel Just <sj...@redhat.com> wrote:
>> > The bug is entirely independent of hardware issues -- entirely a ceph
>> > bug.  xfs doesn't let us specify an ordering when reading a directory,
>> > so we have to keep directory sizes small.  That means that when one of
>> > those pg collection subfolders has 320 files in it, we split it into
>> > up to 16 smaller directories.  Overwriting or removing an ec object
>> > requires us to rename the old version out of the way in case we need
>> > to roll back (that's the generation number I mentioned above).  For
>> > crash safety, this involves first creating a link to the new name,
>> > then removing the old one.  Both the old and new link will be in the
>> > same subdirectory.  If creating the new link pushes the directory to
>> > 320 files then we do a split while both links are present.  If the
>> > file in question is using the special long filename handling, then a
>> > bug in the resulting link juggling causes us to orphan the old version
>> > of the file.  Your cluster seems to have an unusual number of objects
>> > with very long names, which is why it is so visible on your cluster.
>> >
>> > There are critical pool sizes where the PGs will all be close to one
>> > of those limits.  It's possible you are not close to one of those
>> > limits.  It's also possible you are nearing one now.  In any case, the
>> > remapping gave the orphaned files an opportunity to cause trouble, but
>> > they don't appear due to remapping.
>> > -Sam
>> >
>> > On Tue, Mar 15, 2016 at 5:41 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>> > wrote:
>> >> One more question.did we hit the bug because we had hardware issues
>> >> during the remapping or would it have happened regardless of the
>> >> hardware
>> >> issues?   e.g. I'm not planning to add any additional hardware soon,
>> >> but
>> >> would the bug pop again on an (unpatched) system not subject to any
>> >> remapping?
>> >>
>> >> thanks,
>> >> jeff
>> >>
>> >> On Tue, Mar 15, 2016 at 7:27 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>>
>> >>> [back on list]
>> >>>
>> >>> ceph-objectstore-tool has a whole bunch o

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Samuel Just
Yep, thanks for all the help tracking down the root cause!
-Sam

On Thu, Mar 17, 2016 at 10:50 AM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> Great, I just recovered the first placement group from this error.   To be
> sure, I  ran a deep-scrub and that comes back clean.
>
> Thanks for all your help.
> Regards,
> Jeff
>
> On Thu, Mar 17, 2016 at 11:58 AM, Samuel Just <sj...@redhat.com> wrote:
>>
>> Oh, it's getting a stat mismatch.  I think what happened is that on
>> one of the earlier repairs it reset the stats to the wrong value (the
>> orphan was causing the primary to scan two objects twice, which
>> matches the stat mismatch I see here).  A pg repair repair will clear
>> that up.
>> -Sam
>>
>> On Thu, Mar 17, 2016 at 9:22 AM, Jeffrey McDonald <jmcdo...@umn.edu>
>> wrote:
>> > Thanks Sam.
>> >
>> > Since I have prepared a script for this, I decided to go ahead with the
>> > checks.(patience isn't one of my extended attributes)
>> >
>> > I've got a file that searches the full erasure encoded spaces and does
>> > your
>> > checklist below.   I have operated only on one PG so far, the 70.459 one
>> > that we've been discussing.There was only the one file that I found
>> > to
>> > be out of place--the one we already discussed/found and it has been
>> > removed.
>> >
>> > The pg is still marked as inconsistent.   I've scrubbed it a couple of
>> > times
>> > now and what I've seen is:
>> >
>> > 2016-03-17 09:29:53.202818 7f2e816f8700  0 log_channel(cluster) log
>> > [INF] :
>> > 70.459 deep-scrub starts
>> > 2016-03-17 09:36:38.436821 7f2e816f8700 -1 log_channel(cluster) log
>> > [ERR] :
>> > 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones,
>> > 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
>> > 68440088914/68445454633 bytes,0/0 hit_set_archive bytes.
>> > 2016-03-17 09:36:38.436844 7f2e816f8700 -1 log_channel(cluster) log
>> > [ERR] :
>> > 70.459 deep-scrub 1 errors
>> > 2016-03-17 09:44:23.592302 7f2e816f8700  0 log_channel(cluster) log
>> > [INF] :
>> > 70.459 deep-scrub starts
>> > 2016-03-17 09:47:01.237846 7f2e816f8700 -1 log_channel(cluster) log
>> > [ERR] :
>> > 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones,
>> > 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
>> > 68440088914/68445454633 bytes,0/0 hit_set_archive bytes.
>> > 2016-03-17 09:47:01.237880 7f2e816f8700 -1 log_channel(cluster) log
>> > [ERR] :
>> > 70.459 deep-scrub 1 errors
>> >
>> >
>> > Should the scrub be sufficient to remove the inconsistent flag?   I took
>> > the
>> > osd offline during the repairs.I've looked at files in all of the
>> > osds
>> > in the placement group and I'm not finding any more problem files.
>> > The
>> > vast majority of files do not have the user.cephos.lfn3 attribute.
>> > There
>> > are 22321 objects that I seen and only about 230 have the
>> > user.cephos.lfn3
>> > file attribute.   The files will have other attributes, just not
>> > user.cephos.lfn3.
>> >
>> > Regards,
>> > Jeff
>> >
>> >
>> > On Wed, Mar 16, 2016 at 3:53 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>
>> >> Ok, like I said, most files with _long at the end are *not orphaned*.
>> >> The generation number also is *not* an indication of whether the file
>> >> is orphaned -- some of the orphaned files will have 
>> >> as the generation number and others won't.  For each long filename
>> >> object in a pg you would have to:
>> >> 1) Pull the long name out of the attr
>> >> 2) Parse the hash out of the long name
>> >> 3) Turn that into a directory path
>> >> 4) Determine whether the file is at the right place in the path
>> >> 5) If not, remove it (or echo it to be checked)
>> >>
>> >> You probably want to wait for someone to get around to writing a
>> >> branch for ceph-objectstore-tool.  Should happen in the next week or
>> >> two.
>> >> -Sam
>> >>
>> >
>> > --
>> >
>> > Jeffrey McDonald, PhD
>> > Assistant Director for HPC Operations
>> > Minnesota Supercomputing Institute
>> > University of Minnesota Twin Cities
>> > 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
>> > 117 Pleasant St SE   phone: +1 612 625-6905
>> > Minneapolis, MN 55455fax:   +1 612 624-8861
>> >
>> >
>
>
>
>
> --
>
> Jeffrey McDonald, PhD
> Assistant Director for HPC Operations
> Minnesota Supercomputing Institute
> University of Minnesota Twin Cities
> 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
> 117 Pleasant St SE   phone: +1 612 625-6905
> Minneapolis, MN 55455fax:   +1 612 624-8861
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-19 Thread Samuel Just
Basically, the lookup process is:

try DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C/DIR_9/DIR_7...doesn't exist
try DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C/DIR_9/...doesn't exist
try DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C/...doesn't exist
try DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/...does exist, object must be here

If DIR_E did not exist, then it would check DIR_9/DIR_5/DIR_4/DIR_D
and so on.  The hash is always 32 bit (8 hex digits) -- baked into the
rados object distribution algorithms.  When DIR_E hits the threshhold
(320 iirc), the objects (files) in that directory will be moved one
more directory deeper.  An object with hash 79CED459 would then be in
DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C/.

Basically, the depth of the tree is dynamic.  The file will be in the
deepest existing path that matches the hash (might even be different
between replicas, the tree structure is purely internal to the
filestore).
-Sam

On Wed, Mar 16, 2016 at 10:46 AM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> OK, I think I have it now.   I do have one more question, in this case, the
> hash indicates the directory structure but how do I know from the hash how
> many levels I should go down.If the hash is a 32-bit hex integer, *how
> do I know how many should be included as part of the hash for the directory
> structure*?
>
> e.g. our example: the hash is 79CED459 and the directory is then the last
> five taken in reverse order, what happens if there are only 4 levels of
> hierarchy?I only have this one example so far.is the 79C of the hash
> constant?   Would the hash pick up another hex character if the pg splits
> again?
>
> Thanks,
> Jeff
>
> On Wed, Mar 16, 2016 at 10:24 AM, Samuel Just <sj...@redhat.com> wrote:
>>
>> There is a directory structure hash, it's just that it's at the end of
>> the name and you'll have to check the xattr I mentioned to find it.
>>
>> I think that file is actually the one we are talking about removing.
>>
>>
>> ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
>> user.cephos.lfn3:
>>
>> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46_3189d_0
>>
>> Notice that the user.cephosd.lfn3 attr has the full name, and it
>> *does* have a hash 79CED459 (you referred to it as a directory hash I
>> think, but it's actually the hash we used to place it on this osd to
>> begin with).
>>
>> In specifically this case, you shouldn't find any files in the
>> DIR_9/DIR_5/DIR_4/DIR_D directory since there are 16 subdirectories
>> (so all hash values should hash to one of those).
>>
>> The one in DIR_9/DIR_5/DIR_4/DIR_D/DIR_E is completely fine -- that's
>> the actual object file, don't remove that.  If you look at the attr:
>>
>>
>> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
>> user.cephos.lfn3:
>>
>> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0
>>
>> The hash is 79CED459, which means that (assuming
>> DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C does *not* exist) it's in the
>> right place.
>>
>> The ENOENT return
>>
>> 2016-03-07 16:11:41.828332 7ff30cdad700 10
>> filestore(/var/lib/ceph/osd/ceph-307) remove
>>
>> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
>> = -2
>> 2016-03-07 21:44:02.197676 7fe96b56f700 10
>> filestore(/var/lib/ceph/osd/ceph-307) remove
>>
>> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
>> = -2
>>
>> actually was a symptom in this case, but, in

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-15 Thread Samuel Just
Ah, actually, I think there will be duplicates only around half the
time -- either the old link or the new link could be orphaned
depending on which xfs decides to list first.  Only if the old link is
orphaned will it match the name of the object once it's recreated.  I
should be able to find time to put together a branch in the next week
or two if you want to wait.  It's still probably worth trying removing
that object in 70.459.
-Sam

On Tue, Mar 15, 2016 at 6:03 PM, Samuel Just <sj...@redhat.com> wrote:
> The bug is entirely independent of hardware issues -- entirely a ceph
> bug.  xfs doesn't let us specify an ordering when reading a directory,
> so we have to keep directory sizes small.  That means that when one of
> those pg collection subfolders has 320 files in it, we split it into
> up to 16 smaller directories.  Overwriting or removing an ec object
> requires us to rename the old version out of the way in case we need
> to roll back (that's the generation number I mentioned above).  For
> crash safety, this involves first creating a link to the new name,
> then removing the old one.  Both the old and new link will be in the
> same subdirectory.  If creating the new link pushes the directory to
> 320 files then we do a split while both links are present.  If the
> file in question is using the special long filename handling, then a
> bug in the resulting link juggling causes us to orphan the old version
> of the file.  Your cluster seems to have an unusual number of objects
> with very long names, which is why it is so visible on your cluster.
>
> There are critical pool sizes where the PGs will all be close to one
> of those limits.  It's possible you are not close to one of those
> limits.  It's also possible you are nearing one now.  In any case, the
> remapping gave the orphaned files an opportunity to cause trouble, but
> they don't appear due to remapping.
> -Sam
>
> On Tue, Mar 15, 2016 at 5:41 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>> One more question.did we hit the bug because we had hardware issues
>> during the remapping or would it have happened regardless of the hardware
>> issues?   e.g. I'm not planning to add any additional hardware soon, but
>> would the bug pop again on an (unpatched) system not subject to any
>> remapping?
>>
>> thanks,
>> jeff
>>
>> On Tue, Mar 15, 2016 at 7:27 PM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> [back on list]
>>>
>>> ceph-objectstore-tool has a whole bunch of machinery for modifying an
>>> offline objectstore.  It would be the easiest place to put it -- you
>>> could add a
>>>
>>> ceph-objectstore-tool --op filestore-repair-orphan-links [--dry-run] ...
>>>
>>> command which would mount the filestore in a special mode and iterate
>>> over all collections and repair them.  If you want to go that route,
>>> we'd be happy to help you get it written.  Once it fixes your cluster,
>>> we'd then be able to merge and backport it in case anyone else hits
>>> it.
>>>
>>> You'd probably be fine doing it while the OSD is live...but as a rule
>>> I usually prefer to do my osd surgery offline.  Journal doesn't matter
>>> here, the orphaned files are basically invisible to the filestore
>>> (except when doing a collection scan for scrub) since they are in the
>>> wrong directory.
>>>
>>> I don't think the orphans are necessarily going to be 0 size.  There
>>> might be quirk of how radosgw creates these objects that always causes
>>> them to be created 0 size than then overwritten with a writefull -- if
>>> that's true it might be the case that you would only see 0 size ones.
>>> -Sam
>>>
>>> On Tue, Mar 15, 2016 at 4:02 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>>> wrote:
>>> > Thanks,  I can try to write a tool to do this.   Does
>>> > ceph-objectstore-tool
>>> > provide a framework?
>>> >
>>> > Can I safely delete the files while the OSD is alive or should I take it
>>> > offline?   Any concerns about the journal?
>>> >
>>> > Are there any other properties of the orphans, e.g. will the orphans
>>> > always
>>> > be size 0?
>>> >
>>> > Thanks!
>>> > Jeff
>>> >
>>> > On Tue, Mar 15, 2016 at 5:35 PM, Samuel Just <sj...@redhat.com> wrote:
>>> >>
>>> >> Ok, a branch merged to master which should fix this
>>> >> (https://github.com/ceph/ceph/pull/8136).  It'll be backported in due
>>> >> course.  

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-15 Thread Samuel Just
The bug is entirely independent of hardware issues -- entirely a ceph
bug.  xfs doesn't let us specify an ordering when reading a directory,
so we have to keep directory sizes small.  That means that when one of
those pg collection subfolders has 320 files in it, we split it into
up to 16 smaller directories.  Overwriting or removing an ec object
requires us to rename the old version out of the way in case we need
to roll back (that's the generation number I mentioned above).  For
crash safety, this involves first creating a link to the new name,
then removing the old one.  Both the old and new link will be in the
same subdirectory.  If creating the new link pushes the directory to
320 files then we do a split while both links are present.  If the
file in question is using the special long filename handling, then a
bug in the resulting link juggling causes us to orphan the old version
of the file.  Your cluster seems to have an unusual number of objects
with very long names, which is why it is so visible on your cluster.

There are critical pool sizes where the PGs will all be close to one
of those limits.  It's possible you are not close to one of those
limits.  It's also possible you are nearing one now.  In any case, the
remapping gave the orphaned files an opportunity to cause trouble, but
they don't appear due to remapping.
-Sam

On Tue, Mar 15, 2016 at 5:41 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> One more question.did we hit the bug because we had hardware issues
> during the remapping or would it have happened regardless of the hardware
> issues?   e.g. I'm not planning to add any additional hardware soon, but
> would the bug pop again on an (unpatched) system not subject to any
> remapping?
>
> thanks,
> jeff
>
> On Tue, Mar 15, 2016 at 7:27 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> [back on list]
>>
>> ceph-objectstore-tool has a whole bunch of machinery for modifying an
>> offline objectstore.  It would be the easiest place to put it -- you
>> could add a
>>
>> ceph-objectstore-tool --op filestore-repair-orphan-links [--dry-run] ...
>>
>> command which would mount the filestore in a special mode and iterate
>> over all collections and repair them.  If you want to go that route,
>> we'd be happy to help you get it written.  Once it fixes your cluster,
>> we'd then be able to merge and backport it in case anyone else hits
>> it.
>>
>> You'd probably be fine doing it while the OSD is live...but as a rule
>> I usually prefer to do my osd surgery offline.  Journal doesn't matter
>> here, the orphaned files are basically invisible to the filestore
>> (except when doing a collection scan for scrub) since they are in the
>> wrong directory.
>>
>> I don't think the orphans are necessarily going to be 0 size.  There
>> might be quirk of how radosgw creates these objects that always causes
>> them to be created 0 size than then overwritten with a writefull -- if
>> that's true it might be the case that you would only see 0 size ones.
>> -Sam
>>
>> On Tue, Mar 15, 2016 at 4:02 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>> wrote:
>> > Thanks,  I can try to write a tool to do this.   Does
>> > ceph-objectstore-tool
>> > provide a framework?
>> >
>> > Can I safely delete the files while the OSD is alive or should I take it
>> > offline?   Any concerns about the journal?
>> >
>> > Are there any other properties of the orphans, e.g. will the orphans
>> > always
>> > be size 0?
>> >
>> > Thanks!
>> > Jeff
>> >
>> > On Tue, Mar 15, 2016 at 5:35 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>
>> >> Ok, a branch merged to master which should fix this
>> >> (https://github.com/ceph/ceph/pull/8136).  It'll be backported in due
>> >> course.  The problem is that that patch won't clean orphaned files
>> >> that already exist.
>> >>
>> >> Let me explain a bit about what the orphaned files look like.  The
>> >> problem is files with object names that result in escaped filenames
>> >> longer than the max filename ceph will create (~250 iirc).  Normally,
>> >> the name of the file is an escaped and sanitized version of the object
>> >> name:
>> >>
>> >>
>> >>
>> >> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0
>> >>
>> >> corresponds to an object like
>> >>
>> >>
>> >>
>> >> c1dcd459/default.32567

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-15 Thread Samuel Just
[back on list]

ceph-objectstore-tool has a whole bunch of machinery for modifying an
offline objectstore.  It would be the easiest place to put it -- you
could add a

ceph-objectstore-tool --op filestore-repair-orphan-links [--dry-run] ...

command which would mount the filestore in a special mode and iterate
over all collections and repair them.  If you want to go that route,
we'd be happy to help you get it written.  Once it fixes your cluster,
we'd then be able to merge and backport it in case anyone else hits
it.

You'd probably be fine doing it while the OSD is live...but as a rule
I usually prefer to do my osd surgery offline.  Journal doesn't matter
here, the orphaned files are basically invisible to the filestore
(except when doing a collection scan for scrub) since they are in the
wrong directory.

I don't think the orphans are necessarily going to be 0 size.  There
might be quirk of how radosgw creates these objects that always causes
them to be created 0 size than then overwritten with a writefull -- if
that's true it might be the case that you would only see 0 size ones.
-Sam

On Tue, Mar 15, 2016 at 4:02 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> Thanks,  I can try to write a tool to do this.   Does ceph-objectstore-tool
> provide a framework?
>
> Can I safely delete the files while the OSD is alive or should I take it
> offline?   Any concerns about the journal?
>
> Are there any other properties of the orphans, e.g. will the orphans always
> be size 0?
>
> Thanks!
> Jeff
>
> On Tue, Mar 15, 2016 at 5:35 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> Ok, a branch merged to master which should fix this
>> (https://github.com/ceph/ceph/pull/8136).  It'll be backported in due
>> course.  The problem is that that patch won't clean orphaned files
>> that already exist.
>>
>> Let me explain a bit about what the orphaned files look like.  The
>> problem is files with object names that result in escaped filenames
>> longer than the max filename ceph will create (~250 iirc).  Normally,
>> the name of the file is an escaped and sanitized version of the object
>> name:
>>
>>
>> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0
>>
>> corresponds to an object like
>>
>>
>> c1dcd459/default.325674.107__shadow_.KePEE8heghHVnlb1_EIupG0I5eROwRn_77/head//70
>>
>> the DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ path is derived from the hash
>> starting with the last value: cd459 -> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/
>>
>> It ends up in DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ because that's the
>> longest path that exists (DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D does not
>> exist -- if DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ ever gets too full,
>> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D would be created and this file
>> would be moved into it).
>>
>> When the escaped filename gets too long, we truncate the filename, and
>> then append a hash and a number yielding a name like:
>>
>>
>> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>>
>> The _long at the end is always present with files like this.
>> fa202ec9b4b3b217275a is the hash of the filename.  The 0 indicates
>> that it's the 0th file with this prefix and this hash -- if there are
>> hash collisions with the same prefix, you'll see _1_ and _2_ and so on
>> to distinguish them (very very unlikely).  When the filename has been
>> truncated as with this one, you will find the full file name in the
>> attrs (attr user.cephosd.lfn3):
>>
>>
>> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
>> user.cephos.lfn3:
>>
>> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0
>>
>> Let's look at one of the orphaned files (the one with the same
>> file-name as the previous one, actually):
>>
>>
>> ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-15 Thread Samuel Just
Ok, a branch merged to master which should fix this
(https://github.com/ceph/ceph/pull/8136).  It'll be backported in due
course.  The problem is that that patch won't clean orphaned files
that already exist.

Let me explain a bit about what the orphaned files look like.  The
problem is files with object names that result in escaped filenames
longer than the max filename ceph will create (~250 iirc).  Normally,
the name of the file is an escaped and sanitized version of the object
name:

./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0

corresponds to an object like

c1dcd459/default.325674.107__shadow_.KePEE8heghHVnlb1_EIupG0I5eROwRn_77/head//70

the DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ path is derived from the hash
starting with the last value: cd459 -> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/

It ends up in DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ because that's the
longest path that exists (DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D does not
exist -- if DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ ever gets too full,
DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D would be created and this file
would be moved into it).

When the escaped filename gets too long, we truncate the filename, and
then append a hash and a number yielding a name like:

./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long

The _long at the end is always present with files like this.
fa202ec9b4b3b217275a is the hash of the filename.  The 0 indicates
that it's the 0th file with this prefix and this hash -- if there are
hash collisions with the same prefix, you'll see _1_ and _2_ and so on
to distinguish them (very very unlikely).  When the filename has been
truncated as with this one, you will find the full file name in the
attrs (attr user.cephosd.lfn3):

./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
user.cephos.lfn3:
default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0

Let's look at one of the orphaned files (the one with the same
file-name as the previous one, actually):

./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
user.cephos.lfn3:
default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46_3189d_0

This one has the same filename as the previous object, but is an
orphan.  What makes it an orphan is that it has hash 79CED459, but is
in ./DIR_9/DIR_5/DIR_4/DIR_D even though
./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E exists (objects-files are always at
the farthest directory from the root matching their hash).  All of the
orphans will be long-file-name objects (but most long-file-name
objects are fine and are neither orphans nor have duplicates -- it's a
fairly low occurrence bug).  In your case, I think *all* of the
orphans will probably happen to have files with duplicate names in the
correct directory -- though might not if the object had actually been
deleted since the bug happened.  When there are duplicates, the full
object names will either be the same or differ by the generation
number at the end (_0 vs 3189d_0) in this case.

Once the orphaned files are cleaned up, your cluster should be back to
normal.  If you want to wait, someone might get time to build a patch
for ceph-objectstore-tool to automate this.  You can try removing the
orphan we identified in pg 70.459 and re-scrubbing to confirm that
that fixes the pg.
-Sam

On Wed, Mar 9, 2016 at 6:58 AM, Jeffrey McDonald  wrote:
> Hi, I went back to the mon logs to see if I could illicit any additional
> information about this PG.
> Prior to 1/27/16, the deep-scrub on this OSD passes(then I see obsolete
> rollback objects found):
>
> ceph.log.4.gz:2016-01-20 09:43:36.195640 osd.307 10.31.0.67:6848/127170 538
> : cluster [INF] 70.459 deep-scrub ok
> ceph.log.4.gz:2016-01-27 09:51:49.952459 osd.307 10.31.0.67:6848/127170 583
> : cluster [INF] 70.459 deep-scrub starts
> ceph.log.4.gz:2016-01-27 

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-08 Thread Samuel Just
That doesn't sound related.  What is it?
-Sam

On Tue, Mar 8, 2016 at 12:15 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> One other oddity I've found is that ceph left 51 GB of data on each of the
> OSDs on the retired hardware.Is that by design or could it indicate some
> other problems?The PGs there seem to now be remapped elsewhere.
>
> Regards,
> Jeff
>
>
> On Tue, Mar 8, 2016 at 2:09 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> The pgs are not actually inconsistent (that is, I think that all of
>> the real objects are present and healthy).  I think each of those pgs
>> has one of these duplicate pairs confusing scrub (and also pg removal
>> -- hence your ENOTEMPTY bug).  Once we figure out what's going on,
>> you'll have to clean them up manually.  Do not repair any of these.  I
>> suggest that you simply disable scrub and ignore the inconsistent flag
>> until we have an idea of what is going on.
>> -Sam
>>
>> On Tue, Mar 8, 2016 at 12:06 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>> wrote:
>> > I restarted the OSDs with the 'unfound' objects and now I have none, but
>> > I
>> > have 43 inconsistent PGs that I need to repair.I only see unfound
>> > files
>> > once issue the 'pg repair'.How do I clear out the inconsistent
>> > states?
>> >
>> >
>> > ceph -s
>> > cluster 5221cc73-869e-4c20-950f-18824ddd6692
>> >  health HEALTH_ERR
>> > 43 pgs inconsistent
>> > 3507 scrub errors
>> > noout flag(s) set
>> >  monmap e9: 3 mons at
>> >
>> > {cephmon1=10.32.16.93:6789/0,cephmon2=10.32.16.85:6789/0,cephmon3=10.32.16.89:6789/0}
>> > election epoch 112718, quorum 0,1,2
>> > cephmon2,cephmon3,cephmon1
>> >  mdsmap e11408: 1/1/1 up {0=0=up:active}
>> >  osdmap e279630: 449 osds: 449 up, 422 in
>> > flags noout
>> >   pgmap v26505719: 7788 pgs, 21 pools, 251 TB data, 88784 kobjects
>> > 412 TB used, 2777 TB / 3190 TB avail
>> > 7731 active+clean
>> >   43 active+clean+inconsistent
>> >7 active+clean+scrubbing+deep
>> >7 active+clean+scrubbing
>> >
>> > Jeff
>> >
>> > On Tue, Mar 8, 2016 at 2:00 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>
>> >> Yeah, that procedure should have isolated any filesystem issues.  Are
>> >> there still unfound objects?
>> >> -sam
>> >>
>> >
>> > --
>> >
>> > Jeffrey McDonald, PhD
>> > Assistant Director for HPC Operations
>> > Minnesota Supercomputing Institute
>> > University of Minnesota Twin Cities
>> > 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
>> > 117 Pleasant St SE   phone: +1 612 625-6905
>> > Minneapolis, MN 55455fax:   +1 612 624-8861
>> >
>> >
>
>
>
>
> --
>
> Jeffrey McDonald, PhD
> Assistant Director for HPC Operations
> Minnesota Supercomputing Institute
> University of Minnesota Twin Cities
> 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
> 117 Pleasant St SE   phone: +1 612 625-6905
> Minneapolis, MN 55455fax:   +1 612 624-8861
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-08 Thread Samuel Just
Yep

On Tue, Mar 8, 2016 at 12:08 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> Will do, I'm in the midst of the xfs filesystem check so it will be a bit
> before I have the filesystem back.
> jeff
>
> On Tue, Mar 8, 2016 at 2:05 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> There are 3 other example pairs on that osd.  Can you gather the same
>> information about these as well?
>>
>>
>> ./70.3bs3_head/DIR_B/DIR_3/DIR_0/DIR_D/DIR_F/default.724733.17\u\ushadow\uprostate\srnaseq\s17b8049d-df3b-4891-875c-b2a077f2af7a\sUNCID\u2256540.b38350db-b102-4a38-9edb-0089ca95840b.130723\uUNC9-SN296\u0386\uBC2E4WACXX\u1\uGATCAG.tar.gz.2~xP5Un7rh\uPfntQTg0FIDqxSILV61nk1._873db704e96860fddf29_0_long
>>
>> ./70.3bs3_head/DIR_B/DIR_3/DIR_0/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s17b8049d-df3b-4891-875c-b2a077f2af7a\sUNCID\u2256540.b38350db-b102-4a38-9edb-0089ca95840b.130723\uUNC9-SN296\u0386\uBC2E4WACXX\u1\uGATCAG.tar.gz.2~xP5Un7rh\uPfntQTg0FIDqxSILV61nk1._873db704e96860fddf29_0_long
>>
>>
>> ./70.ebs1_head/DIR_B/DIR_E/DIR_0/DIR_6/DIR_C/default.724733.17\u\ushadow\uprostate\srnaseq\s17b8049d-df3b-4891-875c-b2a077f2af7a\sUNCID\u2256540.b38350db-b102-4a38-9edb-0089ca95840b.130723\uUNC9-SN296\u0386\uBC2E4WACXX\u1\uGATCAG.tar.gz.2~xP5Un7rh\uPfntQTg0FIDqxSILV61nk1._f23131e55994b3cd542a_0_long
>>
>> ./70.ebs1_head/DIR_B/DIR_E/DIR_0/DIR_6/default.724733.17\u\ushadow\uprostate\srnaseq\s17b8049d-df3b-4891-875c-b2a077f2af7a\sUNCID\u2256540.b38350db-b102-4a38-9edb-0089ca95840b.130723\uUNC9-SN296\u0386\uBC2E4WACXX\u1\uGATCAG.tar.gz.2~xP5Un7rh\uPfntQTg0FIDqxSILV61nk1._f23131e55994b3cd542a_0_long
>>
>>
>> ./70.3bs3_head/DIR_B/DIR_3/DIR_0/DIR_9/DIR_C/default.724733.17\u\ushadow\uprostate\srnaseq\s41fde786-cbfb-4c11-8696-fe20f90f062a\sUNCID\u2256480.79922276-6e00-4f17-851f-554f52f71520.130723\uUNC9-SN296\u0386\uBC2E4WACXX\u2\uACTTGA.tar.gz.2~TWfBUlEkk4EPH4u\uNkMjmz65CRSkJA3._215ce1442b16dc173b77_0_long
>>
>> ./70.3bs3_head/DIR_B/DIR_3/DIR_0/DIR_9/default.724733.17\u\ushadow\uprostate\srnaseq\s41fde786-cbfb-4c11-8696-fe20f90f062a\sUNCID\u2256480.79922276-6e00-4f17-851f-554f52f71520.130723\uUNC9-SN296\u0386\uBC2E4WACXX\u2\uACTTGA.tar.gz.2~TWfBUlEkk4EPH4u\uNkMjmz65CRSkJA3._215ce1442b16dc173b77_0_long
>>
>> On Tue, Mar 8, 2016 at 12:00 PM, Samuel Just <sj...@redhat.com> wrote:
>> > Yeah, that procedure should have isolated any filesystem issues.  Are
>> > there still unfound objects?
>> > -sam
>> >
>> > On Tue, Mar 8, 2016 at 11:58 AM, Jeffrey McDonald <jmcdo...@umn.edu>
>> > wrote:
>> >> Yes, I used the crushmap, set them to 0, then let ceph migrate/remap
>> >> them
>> >> over.   I controlled the tempo of the move by allowing fewer backfill
>> >> threads.
>> >>
>> >> Jeff
>> >>
>
>
> --
>
> Jeffrey McDonald, PhD
> Assistant Director for HPC Operations
> Minnesota Supercomputing Institute
> University of Minnesota Twin Cities
> 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
> 117 Pleasant St SE   phone: +1 612 625-6905
> Minneapolis, MN 55455fax:   +1 612 624-8861
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-08 Thread Samuel Just
The pgs are not actually inconsistent (that is, I think that all of
the real objects are present and healthy).  I think each of those pgs
has one of these duplicate pairs confusing scrub (and also pg removal
-- hence your ENOTEMPTY bug).  Once we figure out what's going on,
you'll have to clean them up manually.  Do not repair any of these.  I
suggest that you simply disable scrub and ignore the inconsistent flag
until we have an idea of what is going on.
-Sam

On Tue, Mar 8, 2016 at 12:06 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> I restarted the OSDs with the 'unfound' objects and now I have none, but I
> have 43 inconsistent PGs that I need to repair.I only see unfound files
> once issue the 'pg repair'.How do I clear out the inconsistent states?
>
>
> ceph -s
> cluster 5221cc73-869e-4c20-950f-18824ddd6692
>  health HEALTH_ERR
> 43 pgs inconsistent
> 3507 scrub errors
> noout flag(s) set
>  monmap e9: 3 mons at
> {cephmon1=10.32.16.93:6789/0,cephmon2=10.32.16.85:6789/0,cephmon3=10.32.16.89:6789/0}
> election epoch 112718, quorum 0,1,2 cephmon2,cephmon3,cephmon1
>  mdsmap e11408: 1/1/1 up {0=0=up:active}
>  osdmap e279630: 449 osds: 449 up, 422 in
> flags noout
>   pgmap v26505719: 7788 pgs, 21 pools, 251 TB data, 88784 kobjects
> 412 TB used, 2777 TB / 3190 TB avail
> 7731 active+clean
>   43 active+clean+inconsistent
>7 active+clean+scrubbing+deep
>    7 active+clean+scrubbing
>
> Jeff
>
> On Tue, Mar 8, 2016 at 2:00 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> Yeah, that procedure should have isolated any filesystem issues.  Are
>> there still unfound objects?
>> -sam
>>
>
> --
>
> Jeffrey McDonald, PhD
> Assistant Director for HPC Operations
> Minnesota Supercomputing Institute
> University of Minnesota Twin Cities
> 599 Walter Library   email: jeffrey.mcdon...@msi.umn.edu
> 117 Pleasant St SE   phone: +1 612 625-6905
> Minneapolis, MN 55455fax:   +1 612 624-8861
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-08 Thread Samuel Just
There are 3 other example pairs on that osd.  Can you gather the same
information about these as well?

./70.3bs3_head/DIR_B/DIR_3/DIR_0/DIR_D/DIR_F/default.724733.17\u\ushadow\uprostate\srnaseq\s17b8049d-df3b-4891-875c-b2a077f2af7a\sUNCID\u2256540.b38350db-b102-4a38-9edb-0089ca95840b.130723\uUNC9-SN296\u0386\uBC2E4WACXX\u1\uGATCAG.tar.gz.2~xP5Un7rh\uPfntQTg0FIDqxSILV61nk1._873db704e96860fddf29_0_long
./70.3bs3_head/DIR_B/DIR_3/DIR_0/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s17b8049d-df3b-4891-875c-b2a077f2af7a\sUNCID\u2256540.b38350db-b102-4a38-9edb-0089ca95840b.130723\uUNC9-SN296\u0386\uBC2E4WACXX\u1\uGATCAG.tar.gz.2~xP5Un7rh\uPfntQTg0FIDqxSILV61nk1._873db704e96860fddf29_0_long

./70.ebs1_head/DIR_B/DIR_E/DIR_0/DIR_6/DIR_C/default.724733.17\u\ushadow\uprostate\srnaseq\s17b8049d-df3b-4891-875c-b2a077f2af7a\sUNCID\u2256540.b38350db-b102-4a38-9edb-0089ca95840b.130723\uUNC9-SN296\u0386\uBC2E4WACXX\u1\uGATCAG.tar.gz.2~xP5Un7rh\uPfntQTg0FIDqxSILV61nk1._f23131e55994b3cd542a_0_long
./70.ebs1_head/DIR_B/DIR_E/DIR_0/DIR_6/default.724733.17\u\ushadow\uprostate\srnaseq\s17b8049d-df3b-4891-875c-b2a077f2af7a\sUNCID\u2256540.b38350db-b102-4a38-9edb-0089ca95840b.130723\uUNC9-SN296\u0386\uBC2E4WACXX\u1\uGATCAG.tar.gz.2~xP5Un7rh\uPfntQTg0FIDqxSILV61nk1._f23131e55994b3cd542a_0_long

./70.3bs3_head/DIR_B/DIR_3/DIR_0/DIR_9/DIR_C/default.724733.17\u\ushadow\uprostate\srnaseq\s41fde786-cbfb-4c11-8696-fe20f90f062a\sUNCID\u2256480.79922276-6e00-4f17-851f-554f52f71520.130723\uUNC9-SN296\u0386\uBC2E4WACXX\u2\uACTTGA.tar.gz.2~TWfBUlEkk4EPH4u\uNkMjmz65CRSkJA3._215ce1442b16dc173b77_0_long
./70.3bs3_head/DIR_B/DIR_3/DIR_0/DIR_9/default.724733.17\u\ushadow\uprostate\srnaseq\s41fde786-cbfb-4c11-8696-fe20f90f062a\sUNCID\u2256480.79922276-6e00-4f17-851f-554f52f71520.130723\uUNC9-SN296\u0386\uBC2E4WACXX\u2\uACTTGA.tar.gz.2~TWfBUlEkk4EPH4u\uNkMjmz65CRSkJA3._215ce1442b16dc173b77_0_long

On Tue, Mar 8, 2016 at 12:00 PM, Samuel Just <sj...@redhat.com> wrote:
> Yeah, that procedure should have isolated any filesystem issues.  Are
> there still unfound objects?
> -sam
>
> On Tue, Mar 8, 2016 at 11:58 AM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>> Yes, I used the crushmap, set them to 0, then let ceph migrate/remap them
>> over.   I controlled the tempo of the move by allowing fewer backfill
>> threads.
>>
>> Jeff
>>
>> On Tue, Mar 8, 2016 at 1:56 PM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> By "migrated", you mean you marked them out one at a time and let ceph
>>> recover them over to new nodes?
>>> -Sam
>>>
>>> On Tue, Mar 8, 2016 at 11:55 AM, Jeffrey McDonald <jmcdo...@umn.edu>
>>> wrote:
>>> > No its not...historical reasons.ceph[1-3] were the nodes that
>>> > were
>>> > retired.   ceph0[1-9] are all new hardware.
>>> > jeff
>>> >
>>> > On Tue, Mar 8, 2016 at 1:52 PM, Samuel Just <sj...@redhat.com> wrote:
>>> >>
>>> >> ceph3 is not the same host as ceph03?
>>> >> -Sam
>>> >>
>>> >> On Tue, Mar 8, 2016 at 11:48 AM, Jeffrey McDonald <jmcdo...@umn.edu>
>>> >> wrote:
>>> >> > Hi Sam,
>>> >> >
>>> >> > 1) Are those two hardlinks to the same file? No:
>>> >> >
>>> >> > # find . -name '*fa202ec9b4b3b217275a*' -exec ls -ltr {} +
>>> >> > -rw-r--r-- 1 root root   0 Jan 23 21:49
>>> >> >
>>> >> >
>>> >> > ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>>> >> > -rw-r--r-- 1 root root 1048576 Jan 23 21:49
>>> >> >
>>> >> >
>>> >> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>>> >> >
>>> >> > one has a zero size.
>>> >> >
>>> >> >  find . -name '*fa202ec9b4b3b217275a*' -exec lsattr {} +
>>> >> > 
>>> >> >
>>> >> >
>>> >> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-08 Thread Samuel Just
Yeah, that procedure should have isolated any filesystem issues.  Are
there still unfound objects?
-sam

On Tue, Mar 8, 2016 at 11:58 AM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> Yes, I used the crushmap, set them to 0, then let ceph migrate/remap them
> over.   I controlled the tempo of the move by allowing fewer backfill
> threads.
>
> Jeff
>
> On Tue, Mar 8, 2016 at 1:56 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> By "migrated", you mean you marked them out one at a time and let ceph
>> recover them over to new nodes?
>> -Sam
>>
>> On Tue, Mar 8, 2016 at 11:55 AM, Jeffrey McDonald <jmcdo...@umn.edu>
>> wrote:
>> > No its not...historical reasons.ceph[1-3] were the nodes that
>> > were
>> > retired.   ceph0[1-9] are all new hardware.
>> > jeff
>> >
>> > On Tue, Mar 8, 2016 at 1:52 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>
>> >> ceph3 is not the same host as ceph03?
>> >> -Sam
>> >>
>> >> On Tue, Mar 8, 2016 at 11:48 AM, Jeffrey McDonald <jmcdo...@umn.edu>
>> >> wrote:
>> >> > Hi Sam,
>> >> >
>> >> > 1) Are those two hardlinks to the same file? No:
>> >> >
>> >> > # find . -name '*fa202ec9b4b3b217275a*' -exec ls -ltr {} +
>> >> > -rw-r--r-- 1 root root   0 Jan 23 21:49
>> >> >
>> >> >
>> >> > ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>> >> > -rw-r--r-- 1 root root 1048576 Jan 23 21:49
>> >> >
>> >> >
>> >> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>> >> >
>> >> > one has a zero size.
>> >> >
>> >> >  find . -name '*fa202ec9b4b3b217275a*' -exec lsattr {} +
>> >> > 
>> >> >
>> >> >
>> >> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>> >> > 
>> >> >
>> >> >
>> >> > ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>> >> >
>> >> > 2) What size is/are it/they?
>> >> >
>> >> > The first has size 0, the second has size 1048576
>> >> >
>> >> > 3) Can you tarball it/them up with their xattrs and get it to me?
>> >> > yes,  find . -name '*fa202ec9b4 find . -name '*fa202ec9b4b3b217275a*'
>> >> > -exec
>> >> > tar zcvp --xattrs -f /tmp/suspectfiles.tgz  {} +
>> >> >
>> >> >
>> >> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\\u\\ushadow\\uprostate\\srnaseq\\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\\sUNCID\\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\\uUNC14-SN744\\u0400\\uAC3LWGACXX\\u7\\uGAGTGG.tar.gz.2~\\u1r\\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>> >> >
>> >> >
>> >> > ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\\u\\ushadow\\uprostate\\srnaseq\\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\\sUNCID\\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\\uUNC14-SN744\\u0400\\uAC3LWGACXX\\u7\\uGAGTGG.tar.gz.2~\\u1r\\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>> >> > b3b217275a*' -exec tar zcvp --xattrs -f /tmp/suspectfiles.tgz  {} +
>> >> >
>> >> > Files are located at:
>> >> > https://drive.google.com/open?id=0Bzz8TrxFvfemYkI1WkdsQ19ScFk
>> >> >
>> >> >
>> >> > find . -name '*fa202ec9b4b3b217275a*' -exec xattr -l {} +
>> >> >
>> >> >
>> >> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.72473

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-08 Thread Samuel Just
By "migrated", you mean you marked them out one at a time and let ceph
recover them over to new nodes?
-Sam

On Tue, Mar 8, 2016 at 11:55 AM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> No its not...historical reasons.ceph[1-3] were the nodes that were
> retired.   ceph0[1-9] are all new hardware.
> jeff
>
> On Tue, Mar 8, 2016 at 1:52 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> ceph3 is not the same host as ceph03?
>> -Sam
>>
>> On Tue, Mar 8, 2016 at 11:48 AM, Jeffrey McDonald <jmcdo...@umn.edu>
>> wrote:
>> > Hi Sam,
>> >
>> > 1) Are those two hardlinks to the same file? No:
>> >
>> > # find . -name '*fa202ec9b4b3b217275a*' -exec ls -ltr {} +
>> > -rw-r--r-- 1 root root   0 Jan 23 21:49
>> >
>> > ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>> > -rw-r--r-- 1 root root 1048576 Jan 23 21:49
>> >
>> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>> >
>> > one has a zero size.
>> >
>> >  find . -name '*fa202ec9b4b3b217275a*' -exec lsattr {} +
>> > 
>> >
>> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>> > 
>> >
>> > ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>> >
>> > 2) What size is/are it/they?
>> >
>> > The first has size 0, the second has size 1048576
>> >
>> > 3) Can you tarball it/them up with their xattrs and get it to me?
>> > yes,  find . -name '*fa202ec9b4 find . -name '*fa202ec9b4b3b217275a*'
>> > -exec
>> > tar zcvp --xattrs -f /tmp/suspectfiles.tgz  {} +
>> >
>> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\\u\\ushadow\\uprostate\\srnaseq\\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\\sUNCID\\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\\uUNC14-SN744\\u0400\\uAC3LWGACXX\\u7\\uGAGTGG.tar.gz.2~\\u1r\\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>> >
>> > ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\\u\\ushadow\\uprostate\\srnaseq\\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\\sUNCID\\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\\uUNC14-SN744\\u0400\\uAC3LWGACXX\\u7\\uGAGTGG.tar.gz.2~\\u1r\\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>> > b3b217275a*' -exec tar zcvp --xattrs -f /tmp/suspectfiles.tgz  {} +
>> >
>> > Files are located at:
>> > https://drive.google.com/open?id=0Bzz8TrxFvfemYkI1WkdsQ19ScFk
>> >
>> >
>> > find . -name '*fa202ec9b4b3b217275a*' -exec xattr -l {} +
>> >
>> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
>> > user.cephos.lfn3:
>> >
>> > default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0
>> >
>> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
>> > user.cephos.spill_out:
>> >    30 00  0.
>> >
>> >
>> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-8804

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-08 Thread Samuel Just
F 59 D4 CE 79 00 00 00 00Y..y
> 0100   00 46 00 00 00 00 00 00 00 06 03 1C 00 00 00 46.F.F
> 0110   00 00 00 00 00 00 00 FF FF FF FF 00 00 00 00 00
> 0120   00 00 00 FF FF FF FF FF FF FF FF 00 00 00 00 9C
> 0130   18 03 00 00 00 00 00 EC FD 01 00 00 00 00 00 00
> 0140   00 00 00 00 00 00 00 02 02 15 00 00 00 08 01 DF
> 0150   0A 00 00 00 00 00 89 FE A7 01 00 00 00 00 00 00
> 0160   00 00 00 00 00 00 00 00 00 00 D4 49 A4 56 9F B6...I.V..
> 0170   08 05 02 02 15 00 00 00 00 00 00 00 00 00 00 00
> 0180   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0190   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 01A0   00 00 78 C6 2E 00 00 00 00 00 00 00 00 00 00 00..x.
> 01B0   00 00 00 34 00 00 00 D4 49 A4 56 7C A9 0F 06 FF...4I.V|
> 01C0   FF FF FF FF FF FF FF   ...
>
> ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
> user.ceph.snapset:
>    02 02 19 00 00 00 00 00 00 00 00 00 00 00 01 00
> 0010   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ...
>
> ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
> user.ceph.hinfo_key:
>    01 01 24 00 00 00 00 00 00 00 00 00 00 00 06 00..$.
> 0010   00 00 FF FF FF FF FF FF FF FF FF FF FF FF FF FF
> 0020   FF FF FF FF FF FF FF FF FF FF  ..
>
> ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long:
> user.cephos.seq:
>    01 01 10 00 00 00 BE 26 65 00 00 00 00 00 00 00...
> 0010   00 00 03 00 00 00 01   ...
>
>
> 4) Has anything unusual ever happened to the host which osd.307 is on?
> Particularly a power loss?
>
> I don't recall anything.   A couple of times the data center overheated (air
> ) but these nodes are in a water-cooled enclosure and were OK.   What I did
> have is stability issues with the older hardware (ceph1,ceph2,ceph3) where
> there weren't outright power failures but frequent system problems where the
> systems ran out of memory and became wedged.  Its likely that this PG was
> migrated from there.   Would migration preserve this problem?
>
> 5) Can you do an xfs fsck on osd.307's filesystem? Will do.   I will report
> back shortly.
>
> Jeff
>
> On Tue, Mar 8, 2016 at 1:12 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> So, I did turn up something interesting.  There is an object with two
>> files (one in an invalid directory):
>>
>> ~/Downloads [deepthought●] » grep 'fa202ec9b4b3b217275a' dir.filtered
>>
>> ./70.459s0_head/DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>>
>> ./70.459s0_head/DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long
>>
>> That file shows up twice, once in DIR_9/DIR_5/DIR_4/DIR_D and once in
>> DIR_9/DIR_5/DIR_4/DIR_D/DIR_E.  The instance of it in
>> DIR_9/DIR_5/DIR_4/DIR_D is causing scrub to return extra objects.
>> Filestore also appears to be unable to delete it:
>>
>> 2016-03-07 21:44:02.193606 7fe96b56f700 15
>> filestore(/var/lib/ceph/osd/ceph-307) remove
>>
>> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
>> 2016-03-07 21:44:02.197676 7fe96b56f700 10
>> filestore(/var/lib/ceph/osd/ceph-307) rem

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-08 Thread Samuel Just
Oh, for the pg with unfound objects, restart the primary, that should fix it.
-Sam

On Tue, Mar 8, 2016 at 6:44 AM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> Resent to ceph-users to be under the message size limit
>
> On Tue, Mar 8, 2016 at 6:16 AM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>>
>> OK, this is  done and I've observed the state change of 70.459 from
>> active+clean to active+clean+inconsistent after the first scrub.
>>
>> Files attached:  bash script of commands (setuposddebug.bash), log script
>> from the script (setuposddebug.log), three pg queries, one at the start, one
>> at the end of the first scrub, one at the end of the second scrub.
>>
>> At this point, I now have 27 active+clean+inconsistent PGs.While I'm
>> not too concerned about how they are labeled, clients cannot extract objects
>> which are in the  PGs and are labeled as unfound.Its important for us to
>> maintain user confidence in the system so I need a fix as soon as
>> possible..
>>
>> The log files from the OSDs are here:
>>
>> https://drive.google.com/open?id=0Bzz8TrxFvfembkt2XzlCZFVJZFU
>>
>> Thanks,
>> Jeff
>>
>> On Mon, Mar 7, 2016 at 7:26 PM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> Yep, just as before.  Actually, do it twice (wait for 'scrubbing' to
>>> go away each time).
>>> -Sam
>>>
>>> On Mon, Mar 7, 2016 at 5:25 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>>> wrote:
>>> > Just to be sure I grab what you need:
>>> >
>>> > 1- set debug logs for the pg 70.459
>>> > 2 - Issue a deep-scrub ceph pg deep-scrub 70.459
>>> > 3- stop once the 70.459 pg goes inconsistent?
>>> >
>>> > Thanks,
>>> > Jeff
>>> >
>>> >
>>> > On Mon, Mar 7, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote:
>>> >>
>>> >> Hmm, I'll look into this a bit more tomorrow.  Can you get the tree
>>> >> structure of the 70.459 pg directory on osd.307 (find . will do fine).
>>> >> -Sam
>>> >>
>>> >> On Mon, Mar 7, 2016 at 4:50 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>>> >> wrote:
>>> >> > 307 is on ceph03.
>>> >> > Jeff
>>> >> >
>>> >> > On Mon, Mar 7, 2016 at 6:48 PM, Samuel Just <sj...@redhat.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Which node is osd.307 on?
>>> >> >> -Sam
>>> >> >>
>>> >> >> On Mon, Mar 7, 2016 at 4:43 PM, Samuel Just <sj...@redhat.com>
>>> >> >> wrote:
>>> >> >> > ' I didn't see the errors in the tracker on the new nodes, but
>>> >> >> > they
>>> >> >> > were only receiving new data, not migrating it.' -- What do you
>>> >> >> > mean
>>> >> >> > by that?
>>> >> >> > -Sam
>>> >> >> >
>>> >> >> > On Mon, Mar 7, 2016 at 4:42 PM, Jeffrey McDonald
>>> >> >> > <jmcdo...@umn.edu>
>>> >> >> > wrote:
>>> >> >> >> The filesystem is xfs everywhere, there are nine hosts.   The
>>> >> >> >> two
>>> >> >> >> new
>>> >> >> >> ceph
>>> >> >> >> nodes 08, 09 have a new kernel.I didn't see the errors in
>>> >> >> >> the
>>> >> >> >> tracker on
>>> >> >> >> the new nodes, but they were only receiving new data, not
>>> >> >> >> migrating
>>> >> >> >> it.
>>> >> >> >> Jeff
>>> >> >> >>
>>> >> >> >> ceph2: Linux ceph2 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> >> >> 22:08:27
>>> >> >> >> UTC
>>> >> >> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> >> ceph1: Linux ceph1 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> >> >> 22:08:27
>>> >> >> >> UTC
>>> >> >> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> >> ceph3: Linux ceph3 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> &

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-07 Thread Samuel Just
Nevermind on http://tracker.ceph.com/issues/14766 , OSD::remove_dir
uses the right collection_list_partial.
-Sam

On Mon, Mar 7, 2016 at 5:26 PM, Samuel Just <sj...@redhat.com> wrote:
> Yep, just as before.  Actually, do it twice (wait for 'scrubbing' to
> go away each time).
> -Sam
>
> On Mon, Mar 7, 2016 at 5:25 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>> Just to be sure I grab what you need:
>>
>> 1- set debug logs for the pg 70.459
>> 2 - Issue a deep-scrub ceph pg deep-scrub 70.459
>> 3- stop once the 70.459 pg goes inconsistent?
>>
>> Thanks,
>> Jeff
>>
>>
>> On Mon, Mar 7, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> Hmm, I'll look into this a bit more tomorrow.  Can you get the tree
>>> structure of the 70.459 pg directory on osd.307 (find . will do fine).
>>> -Sam
>>>
>>> On Mon, Mar 7, 2016 at 4:50 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>>> > 307 is on ceph03.
>>> > Jeff
>>> >
>>> > On Mon, Mar 7, 2016 at 6:48 PM, Samuel Just <sj...@redhat.com> wrote:
>>> >>
>>> >> Which node is osd.307 on?
>>> >> -Sam
>>> >>
>>> >> On Mon, Mar 7, 2016 at 4:43 PM, Samuel Just <sj...@redhat.com> wrote:
>>> >> > ' I didn't see the errors in the tracker on the new nodes, but they
>>> >> > were only receiving new data, not migrating it.' -- What do you mean
>>> >> > by that?
>>> >> > -Sam
>>> >> >
>>> >> > On Mon, Mar 7, 2016 at 4:42 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>>> >> > wrote:
>>> >> >> The filesystem is xfs everywhere, there are nine hosts.   The two
>>> >> >> new
>>> >> >> ceph
>>> >> >> nodes 08, 09 have a new kernel.I didn't see the errors in the
>>> >> >> tracker on
>>> >> >> the new nodes, but they were only receiving new data, not migrating
>>> >> >> it.
>>> >> >> Jeff
>>> >> >>
>>> >> >> ceph2: Linux ceph2 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> >> 22:08:27
>>> >> >> UTC
>>> >> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> ceph1: Linux ceph1 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> >> 22:08:27
>>> >> >> UTC
>>> >> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> ceph3: Linux ceph3 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> >> 22:08:27
>>> >> >> UTC
>>> >> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> ceph03: Linux ceph03 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> >> 22:08:27
>>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> ceph01: Linux ceph01 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> >> 22:08:27
>>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> ceph02: Linux ceph02 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> >> 22:08:27
>>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> ceph06: Linux ceph06 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> >> 22:08:27
>>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> ceph05: Linux ceph05 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> >> 22:08:27
>>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> ceph04: Linux ceph04 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> >> 22:08:27
>>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> ceph08: Linux ceph08 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri
>>> >> >> Feb
>>> >> >> 26
>>> >> >> 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> ceph07: Linux ceph07 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> >> 22:08:27
>>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> >> ceph09: Linux ceph09 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri
>>> >> >> Feb
>>> >> >> 26
>>> >> >> 22:02:58 UTC 2016 x

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-07 Thread Samuel Just
Yep, just as before.  Actually, do it twice (wait for 'scrubbing' to
go away each time).
-Sam

On Mon, Mar 7, 2016 at 5:25 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> Just to be sure I grab what you need:
>
> 1- set debug logs for the pg 70.459
> 2 - Issue a deep-scrub ceph pg deep-scrub 70.459
> 3- stop once the 70.459 pg goes inconsistent?
>
> Thanks,
> Jeff
>
>
> On Mon, Mar 7, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> Hmm, I'll look into this a bit more tomorrow.  Can you get the tree
>> structure of the 70.459 pg directory on osd.307 (find . will do fine).
>> -Sam
>>
>> On Mon, Mar 7, 2016 at 4:50 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>> > 307 is on ceph03.
>> > Jeff
>> >
>> > On Mon, Mar 7, 2016 at 6:48 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>
>> >> Which node is osd.307 on?
>> >> -Sam
>> >>
>> >> On Mon, Mar 7, 2016 at 4:43 PM, Samuel Just <sj...@redhat.com> wrote:
>> >> > ' I didn't see the errors in the tracker on the new nodes, but they
>> >> > were only receiving new data, not migrating it.' -- What do you mean
>> >> > by that?
>> >> > -Sam
>> >> >
>> >> > On Mon, Mar 7, 2016 at 4:42 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>> >> > wrote:
>> >> >> The filesystem is xfs everywhere, there are nine hosts.   The two
>> >> >> new
>> >> >> ceph
>> >> >> nodes 08, 09 have a new kernel.I didn't see the errors in the
>> >> >> tracker on
>> >> >> the new nodes, but they were only receiving new data, not migrating
>> >> >> it.
>> >> >> Jeff
>> >> >>
>> >> >> ceph2: Linux ceph2 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> >> 22:08:27
>> >> >> UTC
>> >> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> ceph1: Linux ceph1 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> >> 22:08:27
>> >> >> UTC
>> >> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> ceph3: Linux ceph3 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> >> 22:08:27
>> >> >> UTC
>> >> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> ceph03: Linux ceph03 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> >> 22:08:27
>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> ceph01: Linux ceph01 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> >> 22:08:27
>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> ceph02: Linux ceph02 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> >> 22:08:27
>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> ceph06: Linux ceph06 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> >> 22:08:27
>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> ceph05: Linux ceph05 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> >> 22:08:27
>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> ceph04: Linux ceph04 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> >> 22:08:27
>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> ceph08: Linux ceph08 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri
>> >> >> Feb
>> >> >> 26
>> >> >> 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> ceph07: Linux ceph07 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> >> 22:08:27
>> >> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> ceph09: Linux ceph09 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri
>> >> >> Feb
>> >> >> 26
>> >> >> 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>> >> >>
>> >> >>
>> >> >> On Mon, Mar 7, 2016 at 6:39 PM, Samuel Just <sj...@redhat.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> What filesystem and kernel are you running on the osds?  This (and
>> >> >>> your other bug, actually) could be explained by some kind of weird
>> >> >>> kernel readdir behavior.
>> >> >>

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-07 Thread Samuel Just
On the plus side, I think I figured out http://tracker.ceph.com/issues/14766.
-Sam

On Mon, Mar 7, 2016 at 4:52 PM, Samuel Just <sj...@redhat.com> wrote:
> Hmm, I'll look into this a bit more tomorrow.  Can you get the tree
> structure of the 70.459 pg directory on osd.307 (find . will do fine).
> -Sam
>
> On Mon, Mar 7, 2016 at 4:50 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>> 307 is on ceph03.
>> Jeff
>>
>> On Mon, Mar 7, 2016 at 6:48 PM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> Which node is osd.307 on?
>>> -Sam
>>>
>>> On Mon, Mar 7, 2016 at 4:43 PM, Samuel Just <sj...@redhat.com> wrote:
>>> > ' I didn't see the errors in the tracker on the new nodes, but they
>>> > were only receiving new data, not migrating it.' -- What do you mean
>>> > by that?
>>> > -Sam
>>> >
>>> > On Mon, Mar 7, 2016 at 4:42 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>>> > wrote:
>>> >> The filesystem is xfs everywhere, there are nine hosts.   The two new
>>> >> ceph
>>> >> nodes 08, 09 have a new kernel.I didn't see the errors in the
>>> >> tracker on
>>> >> the new nodes, but they were only receiving new data, not migrating it.
>>> >> Jeff
>>> >>
>>> >> ceph2: Linux ceph2 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>>> >> UTC
>>> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> ceph1: Linux ceph1 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>>> >> UTC
>>> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> ceph3: Linux ceph3 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>>> >> UTC
>>> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> ceph03: Linux ceph03 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> 22:08:27
>>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> ceph01: Linux ceph01 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> 22:08:27
>>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> ceph02: Linux ceph02 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> 22:08:27
>>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> ceph06: Linux ceph06 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> 22:08:27
>>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> ceph05: Linux ceph05 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> 22:08:27
>>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> ceph04: Linux ceph04 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> 22:08:27
>>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> ceph08: Linux ceph08 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb
>>> >> 26
>>> >> 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>> >> ceph07: Linux ceph07 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>>> >> 22:08:27
>>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> >> ceph09: Linux ceph09 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb
>>> >> 26
>>> >> 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>> >>
>>> >>
>>> >> On Mon, Mar 7, 2016 at 6:39 PM, Samuel Just <sj...@redhat.com> wrote:
>>> >>>
>>> >>> What filesystem and kernel are you running on the osds?  This (and
>>> >>> your other bug, actually) could be explained by some kind of weird
>>> >>> kernel readdir behavior.
>>> >>> -Sam
>>> >>>
>>> >>> On Mon, Mar 7, 2016 at 4:36 PM, Samuel Just <sj...@redhat.com> wrote:
>>> >>> > Hmm, so much for that theory, still looking.  If you can produce
>>> >>> > another set of logs (as before) from scrubbing that pg, it might
>>> >>> > help.
>>> >>> > -Sam
>>> >>> >
>>> >>> > On Mon, Mar 7, 2016 at 4:34 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>>> >>> > wrote:
>>> >>> >> they're all the same.see attached.
>>> >>> >>
>>> >>> >> On Mon, Mar 7, 2016 at 6:31 PM, Samuel Just <sj...@redhat.com>
>>> >>> >> wrote:
>>> >>> >>>
>>> >>> >>> Have you confirmed th

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-07 Thread Samuel Just
Hmm, I'll look into this a bit more tomorrow.  Can you get the tree
structure of the 70.459 pg directory on osd.307 (find . will do fine).
-Sam

On Mon, Mar 7, 2016 at 4:50 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> 307 is on ceph03.
> Jeff
>
> On Mon, Mar 7, 2016 at 6:48 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> Which node is osd.307 on?
>> -Sam
>>
>> On Mon, Mar 7, 2016 at 4:43 PM, Samuel Just <sj...@redhat.com> wrote:
>> > ' I didn't see the errors in the tracker on the new nodes, but they
>> > were only receiving new data, not migrating it.' -- What do you mean
>> > by that?
>> > -Sam
>> >
>> > On Mon, Mar 7, 2016 at 4:42 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>> > wrote:
>> >> The filesystem is xfs everywhere, there are nine hosts.   The two new
>> >> ceph
>> >> nodes 08, 09 have a new kernel.I didn't see the errors in the
>> >> tracker on
>> >> the new nodes, but they were only receiving new data, not migrating it.
>> >> Jeff
>> >>
>> >> ceph2: Linux ceph2 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>> >> UTC
>> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> ceph1: Linux ceph1 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>> >> UTC
>> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> ceph3: Linux ceph3 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>> >> UTC
>> >> 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> ceph03: Linux ceph03 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> 22:08:27
>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> ceph01: Linux ceph01 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> 22:08:27
>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> ceph02: Linux ceph02 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> 22:08:27
>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> ceph06: Linux ceph06 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> 22:08:27
>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> ceph05: Linux ceph05 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> 22:08:27
>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> ceph04: Linux ceph04 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> 22:08:27
>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> ceph08: Linux ceph08 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb
>> >> 26
>> >> 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>> >> ceph07: Linux ceph07 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2
>> >> 22:08:27
>> >> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> ceph09: Linux ceph09 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb
>> >> 26
>> >> 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>> >>
>> >>
>> >> On Mon, Mar 7, 2016 at 6:39 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>>
>> >>> What filesystem and kernel are you running on the osds?  This (and
>> >>> your other bug, actually) could be explained by some kind of weird
>> >>> kernel readdir behavior.
>> >>> -Sam
>> >>>
>> >>> On Mon, Mar 7, 2016 at 4:36 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>> > Hmm, so much for that theory, still looking.  If you can produce
>> >>> > another set of logs (as before) from scrubbing that pg, it might
>> >>> > help.
>> >>> > -Sam
>> >>> >
>> >>> > On Mon, Mar 7, 2016 at 4:34 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>> >>> > wrote:
>> >>> >> they're all the same.see attached.
>> >>> >>
>> >>> >> On Mon, Mar 7, 2016 at 6:31 PM, Samuel Just <sj...@redhat.com>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> Have you confirmed the versions?
>> >>> >>> -Sam
>> >>> >>>
>> >>> >>> On Mon, Mar 7, 2016 at 4:29 PM, Jeffrey McDonald
>> >>> >>> <jmcdo...@umn.edu>
>> >>> >>> wrote:
>> >>> >>> > I have one other very strange event happening, I've opened a
>> >>> >>> > ticket
>> >>> >>> > on
>> >>> >>> > it:
>> &

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-07 Thread Samuel Just
Which node is osd.307 on?
-Sam

On Mon, Mar 7, 2016 at 4:43 PM, Samuel Just <sj...@redhat.com> wrote:
> ' I didn't see the errors in the tracker on the new nodes, but they
> were only receiving new data, not migrating it.' -- What do you mean
> by that?
> -Sam
>
> On Mon, Mar 7, 2016 at 4:42 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>> The filesystem is xfs everywhere, there are nine hosts.   The two new ceph
>> nodes 08, 09 have a new kernel.I didn't see the errors in the tracker on
>> the new nodes, but they were only receiving new data, not migrating it.
>> Jeff
>>
>> ceph2: Linux ceph2 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27 UTC
>> 2015 x86_64 x86_64 x86_64 GNU/Linux
>> ceph1: Linux ceph1 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27 UTC
>> 2015 x86_64 x86_64 x86_64 GNU/Linux
>> ceph3: Linux ceph3 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27 UTC
>> 2015 x86_64 x86_64 x86_64 GNU/Linux
>> ceph03: Linux ceph03 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> ceph01: Linux ceph01 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> ceph02: Linux ceph02 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> ceph06: Linux ceph06 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> ceph05: Linux ceph05 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> ceph04: Linux ceph04 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> ceph08: Linux ceph08 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb 26
>> 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>> ceph07: Linux ceph07 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> ceph09: Linux ceph09 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb 26
>> 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>
>>
>> On Mon, Mar 7, 2016 at 6:39 PM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> What filesystem and kernel are you running on the osds?  This (and
>>> your other bug, actually) could be explained by some kind of weird
>>> kernel readdir behavior.
>>> -Sam
>>>
>>> On Mon, Mar 7, 2016 at 4:36 PM, Samuel Just <sj...@redhat.com> wrote:
>>> > Hmm, so much for that theory, still looking.  If you can produce
>>> > another set of logs (as before) from scrubbing that pg, it might help.
>>> > -Sam
>>> >
>>> > On Mon, Mar 7, 2016 at 4:34 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>>> > wrote:
>>> >> they're all the same.see attached.
>>> >>
>>> >> On Mon, Mar 7, 2016 at 6:31 PM, Samuel Just <sj...@redhat.com> wrote:
>>> >>>
>>> >>> Have you confirmed the versions?
>>> >>> -Sam
>>> >>>
>>> >>> On Mon, Mar 7, 2016 at 4:29 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>>> >>> wrote:
>>> >>> > I have one other very strange event happening, I've opened a ticket
>>> >>> > on
>>> >>> > it:
>>> >>> > http://tracker.ceph.com/issues/14766
>>> >>> >
>>> >>> > During this migration, OSD failed probably over 400 times while
>>> >>> > moving
>>> >>> > data
>>> >>> > around.   We move the empty directories and restarted the OSDs.    I
>>> >>> > can't
>>> >>> > say if this is related--I have no reason to suspect it is.
>>> >>> >
>>> >>> > Jeff
>>> >>> >
>>> >>> > On Mon, Mar 7, 2016 at 5:31 PM, Shinobu Kinjo <shinobu...@gmail.com>
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> What could cause this kind of unexpected behaviour?
>>> >>> >> Any assumption??
>>> >>> >> Sorry for interrupting you.
>>> >>> >>
>>> >>> >> Cheers,
>>> >>> >> S
>>> >>> >>
>>> >>> >> On Tue, Mar 8, 2016 at 8:19 AM, Samuel Just <sj...@redhat.com>
>>> >>> >> wrote:
>>> >>> >> > Hmm, at the end of the log, the pg is still inconsistent.

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-07 Thread Samuel Just
' I didn't see the errors in the tracker on the new nodes, but they
were only receiving new data, not migrating it.' -- What do you mean
by that?
-Sam

On Mon, Mar 7, 2016 at 4:42 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> The filesystem is xfs everywhere, there are nine hosts.   The two new ceph
> nodes 08, 09 have a new kernel.I didn't see the errors in the tracker on
> the new nodes, but they were only receiving new data, not migrating it.
> Jeff
>
> ceph2: Linux ceph2 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27 UTC
> 2015 x86_64 x86_64 x86_64 GNU/Linux
> ceph1: Linux ceph1 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27 UTC
> 2015 x86_64 x86_64 x86_64 GNU/Linux
> ceph3: Linux ceph3 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27 UTC
> 2015 x86_64 x86_64 x86_64 GNU/Linux
> ceph03: Linux ceph03 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> ceph01: Linux ceph01 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> ceph02: Linux ceph02 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> ceph06: Linux ceph06 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> ceph05: Linux ceph05 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> ceph04: Linux ceph04 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> ceph08: Linux ceph08 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb 26
> 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> ceph07: Linux ceph07 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27
> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> ceph09: Linux ceph09 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb 26
> 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>
>
> On Mon, Mar 7, 2016 at 6:39 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> What filesystem and kernel are you running on the osds?  This (and
>> your other bug, actually) could be explained by some kind of weird
>> kernel readdir behavior.
>> -Sam
>>
>> On Mon, Mar 7, 2016 at 4:36 PM, Samuel Just <sj...@redhat.com> wrote:
>> > Hmm, so much for that theory, still looking.  If you can produce
>> > another set of logs (as before) from scrubbing that pg, it might help.
>> > -Sam
>> >
>> > On Mon, Mar 7, 2016 at 4:34 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>> > wrote:
>> >> they're all the same.see attached.
>> >>
>> >> On Mon, Mar 7, 2016 at 6:31 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>>
>> >>> Have you confirmed the versions?
>> >>> -Sam
>> >>>
>> >>> On Mon, Mar 7, 2016 at 4:29 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>> >>> wrote:
>> >>> > I have one other very strange event happening, I've opened a ticket
>> >>> > on
>> >>> > it:
>> >>> > http://tracker.ceph.com/issues/14766
>> >>> >
>> >>> > During this migration, OSD failed probably over 400 times while
>> >>> > moving
>> >>> > data
>> >>> > around.   We move the empty directories and restarted the OSDs.I
>> >>> > can't
>> >>> > say if this is related--I have no reason to suspect it is.
>> >>> >
>> >>> > Jeff
>> >>> >
>> >>> > On Mon, Mar 7, 2016 at 5:31 PM, Shinobu Kinjo <shinobu...@gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> What could cause this kind of unexpected behaviour?
>> >>> >> Any assumption??
>> >>> >> Sorry for interrupting you.
>> >>> >>
>> >>> >> Cheers,
>> >>> >> S
>> >>> >>
>> >>> >> On Tue, Mar 8, 2016 at 8:19 AM, Samuel Just <sj...@redhat.com>
>> >>> >> wrote:
>> >>> >> > Hmm, at the end of the log, the pg is still inconsistent.  Can
>> >>> >> > you
>> >>> >> > attach a ceph pg query on that pg?
>> >>> >> > -Sam
>> >>> >> >
>> >>> >> > On Mon, Mar 7, 2016 at 3:05 PM, Samuel Just <sj...@redhat.com>
>> >>> >> > wrote:
>> >>> >> >> If so, that strongly suggests that the pg was actually never
>> &

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-07 Thread Samuel Just
I'd rather you just scrubbed the same pg with the same osds and the
same debugging.
-Sam

On Mon, Mar 7, 2016 at 4:40 PM, Samuel Just <sj...@redhat.com> wrote:
> Yes, the unfound objects are due to a bug in the repair command.  I
> suggest you don't repair anything, actually.  I don't think any of the
> pgs are actually inconsistent.  The log I have here is definitely a
> case of two objects showing up as inconsistent which are actually
> fine.
> -Sam
>
> On Mon, Mar 7, 2016 at 4:38 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>> I have three more pgs now showing up as inconsistent.   Should I turn on
>> debug for all OSDs to capture the transition from active+clean ->
>> inconsistent?Do I understand correction that the repair cause the
>> 'unfound' objects because of a bug in the repair command?
>> Regards,
>> Jeff
>>
>>
>>
>> On Mon, Mar 7, 2016 at 6:36 PM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> Hmm, so much for that theory, still looking.  If you can produce
>>> another set of logs (as before) from scrubbing that pg, it might help.
>>> -Sam
>>>
>>> On Mon, Mar 7, 2016 at 4:34 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>>> > they're all the same.see attached.
>>> >
>>> > On Mon, Mar 7, 2016 at 6:31 PM, Samuel Just <sj...@redhat.com> wrote:
>>> >>
>>> >> Have you confirmed the versions?
>>> >> -Sam
>>> >>
>>> >> On Mon, Mar 7, 2016 at 4:29 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>>> >> wrote:
>>> >> > I have one other very strange event happening, I've opened a ticket
>>> >> > on
>>> >> > it:
>>> >> > http://tracker.ceph.com/issues/14766
>>> >> >
>>> >> > During this migration, OSD failed probably over 400 times while
>>> >> > moving
>>> >> > data
>>> >> > around.   We move the empty directories and restarted the OSDs.I
>>> >> > can't
>>> >> > say if this is related--I have no reason to suspect it is.
>>> >> >
>>> >> > Jeff
>>> >> >
>>> >> > On Mon, Mar 7, 2016 at 5:31 PM, Shinobu Kinjo <shinobu...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> What could cause this kind of unexpected behaviour?
>>> >> >> Any assumption??
>>> >> >> Sorry for interrupting you.
>>> >> >>
>>> >> >> Cheers,
>>> >> >> S
>>> >> >>
>>> >> >> On Tue, Mar 8, 2016 at 8:19 AM, Samuel Just <sj...@redhat.com>
>>> >> >> wrote:
>>> >> >> > Hmm, at the end of the log, the pg is still inconsistent.  Can you
>>> >> >> > attach a ceph pg query on that pg?
>>> >> >> > -Sam
>>> >> >> >
>>> >> >> > On Mon, Mar 7, 2016 at 3:05 PM, Samuel Just <sj...@redhat.com>
>>> >> >> > wrote:
>>> >> >> >> If so, that strongly suggests that the pg was actually never
>>> >> >> >> inconsistent in the first place and that the bug is in scrub
>>> >> >> >> itself
>>> >> >> >> presumably getting confused about an object during a write.  The
>>> >> >> >> next
>>> >> >> >> step would be to get logs like the above from a pg as it scrubs
>>> >> >> >> transitioning from clean to inconsistent.  If it's really a race
>>> >> >> >> between scrub and a write, it's probably just non-deterministic,
>>> >> >> >> you
>>> >> >> >> could set logging on a set of osds and continuously scrub any pgs
>>> >> >> >> which only map to those osds until you reproduce the problem.
>>> >> >> >> -Sam
>>> >> >> >>
>>> >> >> >> On Mon, Mar 7, 2016 at 2:44 PM, Samuel Just <sj...@redhat.com>
>>> >> >> >> wrote:
>>> >> >> >>> So after the scrub, it came up clean?  The inconsistent/missing
>>> >> >> >>> objects reappeared?
>>> >> >> >>> -Sam
>>> >> >> >>>
>>> >> >> >>> On Mon

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-07 Thread Samuel Just
Yes, the unfound objects are due to a bug in the repair command.  I
suggest you don't repair anything, actually.  I don't think any of the
pgs are actually inconsistent.  The log I have here is definitely a
case of two objects showing up as inconsistent which are actually
fine.
-Sam

On Mon, Mar 7, 2016 at 4:38 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> I have three more pgs now showing up as inconsistent.   Should I turn on
> debug for all OSDs to capture the transition from active+clean ->
> inconsistent?Do I understand correction that the repair cause the
> 'unfound' objects because of a bug in the repair command?
> Regards,
> Jeff
>
>
>
> On Mon, Mar 7, 2016 at 6:36 PM, Samuel Just <sj...@redhat.com> wrote:
>>
>> Hmm, so much for that theory, still looking.  If you can produce
>> another set of logs (as before) from scrubbing that pg, it might help.
>> -Sam
>>
>> On Mon, Mar 7, 2016 at 4:34 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>> > they're all the same.see attached.
>> >
>> > On Mon, Mar 7, 2016 at 6:31 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>
>> >> Have you confirmed the versions?
>> >> -Sam
>> >>
>> >> On Mon, Mar 7, 2016 at 4:29 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>> >> wrote:
>> >> > I have one other very strange event happening, I've opened a ticket
>> >> > on
>> >> > it:
>> >> > http://tracker.ceph.com/issues/14766
>> >> >
>> >> > During this migration, OSD failed probably over 400 times while
>> >> > moving
>> >> > data
>> >> > around.   We move the empty directories and restarted the OSDs.I
>> >> > can't
>> >> > say if this is related--I have no reason to suspect it is.
>> >> >
>> >> > Jeff
>> >> >
>> >> > On Mon, Mar 7, 2016 at 5:31 PM, Shinobu Kinjo <shinobu...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> What could cause this kind of unexpected behaviour?
>> >> >> Any assumption??
>> >> >> Sorry for interrupting you.
>> >> >>
>> >> >> Cheers,
>> >> >> S
>> >> >>
>> >> >> On Tue, Mar 8, 2016 at 8:19 AM, Samuel Just <sj...@redhat.com>
>> >> >> wrote:
>> >> >> > Hmm, at the end of the log, the pg is still inconsistent.  Can you
>> >> >> > attach a ceph pg query on that pg?
>> >> >> > -Sam
>> >> >> >
>> >> >> > On Mon, Mar 7, 2016 at 3:05 PM, Samuel Just <sj...@redhat.com>
>> >> >> > wrote:
>> >> >> >> If so, that strongly suggests that the pg was actually never
>> >> >> >> inconsistent in the first place and that the bug is in scrub
>> >> >> >> itself
>> >> >> >> presumably getting confused about an object during a write.  The
>> >> >> >> next
>> >> >> >> step would be to get logs like the above from a pg as it scrubs
>> >> >> >> transitioning from clean to inconsistent.  If it's really a race
>> >> >> >> between scrub and a write, it's probably just non-deterministic,
>> >> >> >> you
>> >> >> >> could set logging on a set of osds and continuously scrub any pgs
>> >> >> >> which only map to those osds until you reproduce the problem.
>> >> >> >> -Sam
>> >> >> >>
>> >> >> >> On Mon, Mar 7, 2016 at 2:44 PM, Samuel Just <sj...@redhat.com>
>> >> >> >> wrote:
>> >> >> >>> So after the scrub, it came up clean?  The inconsistent/missing
>> >> >> >>> objects reappeared?
>> >> >> >>> -Sam
>> >> >> >>>
>> >> >> >>> On Mon, Mar 7, 2016 at 2:33 PM, Jeffrey McDonald
>> >> >> >>> <jmcdo...@umn.edu>
>> >> >> >>> wrote:
>> >> >> >>>> Hi Sam,
>> >> >> >>>>
>> >> >> >>>> I've done as you requested:
>> >> >> >>>>
>> >> >> >>>> pg 70.459 is active+clean+inconsistent, acting
>> >> >> >>>> [307,210,273,1

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-07 Thread Samuel Just
What filesystem and kernel are you running on the osds?  This (and
your other bug, actually) could be explained by some kind of weird
kernel readdir behavior.
-Sam

On Mon, Mar 7, 2016 at 4:36 PM, Samuel Just <sj...@redhat.com> wrote:
> Hmm, so much for that theory, still looking.  If you can produce
> another set of logs (as before) from scrubbing that pg, it might help.
> -Sam
>
> On Mon, Mar 7, 2016 at 4:34 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>> they're all the same.see attached.
>>
>> On Mon, Mar 7, 2016 at 6:31 PM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> Have you confirmed the versions?
>>> -Sam
>>>
>>> On Mon, Mar 7, 2016 at 4:29 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>>> > I have one other very strange event happening, I've opened a ticket on
>>> > it:
>>> > http://tracker.ceph.com/issues/14766
>>> >
>>> > During this migration, OSD failed probably over 400 times while moving
>>> > data
>>> > around.   We move the empty directories and restarted the OSDs.I
>>> > can't
>>> > say if this is related--I have no reason to suspect it is.
>>> >
>>> > Jeff
>>> >
>>> > On Mon, Mar 7, 2016 at 5:31 PM, Shinobu Kinjo <shinobu...@gmail.com>
>>> > wrote:
>>> >>
>>> >> What could cause this kind of unexpected behaviour?
>>> >> Any assumption??
>>> >> Sorry for interrupting you.
>>> >>
>>> >> Cheers,
>>> >> S
>>> >>
>>> >> On Tue, Mar 8, 2016 at 8:19 AM, Samuel Just <sj...@redhat.com> wrote:
>>> >> > Hmm, at the end of the log, the pg is still inconsistent.  Can you
>>> >> > attach a ceph pg query on that pg?
>>> >> > -Sam
>>> >> >
>>> >> > On Mon, Mar 7, 2016 at 3:05 PM, Samuel Just <sj...@redhat.com> wrote:
>>> >> >> If so, that strongly suggests that the pg was actually never
>>> >> >> inconsistent in the first place and that the bug is in scrub itself
>>> >> >> presumably getting confused about an object during a write.  The
>>> >> >> next
>>> >> >> step would be to get logs like the above from a pg as it scrubs
>>> >> >> transitioning from clean to inconsistent.  If it's really a race
>>> >> >> between scrub and a write, it's probably just non-deterministic, you
>>> >> >> could set logging on a set of osds and continuously scrub any pgs
>>> >> >> which only map to those osds until you reproduce the problem.
>>> >> >> -Sam
>>> >> >>
>>> >> >> On Mon, Mar 7, 2016 at 2:44 PM, Samuel Just <sj...@redhat.com>
>>> >> >> wrote:
>>> >> >>> So after the scrub, it came up clean?  The inconsistent/missing
>>> >> >>> objects reappeared?
>>> >> >>> -Sam
>>> >> >>>
>>> >> >>> On Mon, Mar 7, 2016 at 2:33 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>>> >> >>> wrote:
>>> >> >>>> Hi Sam,
>>> >> >>>>
>>> >> >>>> I've done as you requested:
>>> >> >>>>
>>> >> >>>> pg 70.459 is active+clean+inconsistent, acting
>>> >> >>>> [307,210,273,191,132,450]
>>> >> >>>>
>>> >> >>>> # for i in 307 210 273 191 132 450 ; do
>>> >> >>>>> ceph tell osd.$i injectargs  '--debug-osd 20 --debug-filestore 20
>>> >> >>>>> --debug-ms 1'
>>> >> >>>>> done
>>> >> >>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>> >> >>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>> >> >>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>> >> >>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>> >> >>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>> >> >>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> # date
>>> >> >>>> Mon Mar  7 16:03:38 CST 2016
>>> >&

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-07 Thread Samuel Just
Have you confirmed the versions?
-Sam

On Mon, Mar 7, 2016 at 4:29 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
> I have one other very strange event happening, I've opened a ticket on it:
> http://tracker.ceph.com/issues/14766
>
> During this migration, OSD failed probably over 400 times while moving data
> around.   We move the empty directories and restarted the OSDs.I can't
> say if this is related--I have no reason to suspect it is.
>
> Jeff
>
> On Mon, Mar 7, 2016 at 5:31 PM, Shinobu Kinjo <shinobu...@gmail.com> wrote:
>>
>> What could cause this kind of unexpected behaviour?
>> Any assumption??
>> Sorry for interrupting you.
>>
>> Cheers,
>> S
>>
>> On Tue, Mar 8, 2016 at 8:19 AM, Samuel Just <sj...@redhat.com> wrote:
>> > Hmm, at the end of the log, the pg is still inconsistent.  Can you
>> > attach a ceph pg query on that pg?
>> > -Sam
>> >
>> > On Mon, Mar 7, 2016 at 3:05 PM, Samuel Just <sj...@redhat.com> wrote:
>> >> If so, that strongly suggests that the pg was actually never
>> >> inconsistent in the first place and that the bug is in scrub itself
>> >> presumably getting confused about an object during a write.  The next
>> >> step would be to get logs like the above from a pg as it scrubs
>> >> transitioning from clean to inconsistent.  If it's really a race
>> >> between scrub and a write, it's probably just non-deterministic, you
>> >> could set logging on a set of osds and continuously scrub any pgs
>> >> which only map to those osds until you reproduce the problem.
>> >> -Sam
>> >>
>> >> On Mon, Mar 7, 2016 at 2:44 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>> So after the scrub, it came up clean?  The inconsistent/missing
>> >>> objects reappeared?
>> >>> -Sam
>> >>>
>> >>> On Mon, Mar 7, 2016 at 2:33 PM, Jeffrey McDonald <jmcdo...@umn.edu>
>> >>> wrote:
>> >>>> Hi Sam,
>> >>>>
>> >>>> I've done as you requested:
>> >>>>
>> >>>> pg 70.459 is active+clean+inconsistent, acting
>> >>>> [307,210,273,191,132,450]
>> >>>>
>> >>>> # for i in 307 210 273 191 132 450 ; do
>> >>>>> ceph tell osd.$i injectargs  '--debug-osd 20 --debug-filestore 20
>> >>>>> --debug-ms 1'
>> >>>>> done
>> >>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>> >>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>> >>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>> >>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>> >>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>> >>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>> >>>>
>> >>>>
>> >>>> # date
>> >>>> Mon Mar  7 16:03:38 CST 2016
>> >>>>
>> >>>>
>> >>>> # ceph pg deep-scrub 70.459
>> >>>> instructing pg 70.459 on osd.307 to deep-scrub
>> >>>>
>> >>>>
>> >>>>
>> >>>> Scrub finished around
>> >>>>
>> >>>> # date
>> >>>> Mon Mar  7 16:13:03 CST 2016
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> I've tar'd+gziped the files which can be downloaded from here.   The
>> >>>> logs
>> >>>> start a minute or two after today at 16:00.
>> >>>>
>> >>>>
>> >>>> https://drive.google.com/folderview?id=0Bzz8TrxFvfema2NQUmotd1BOTnM=sharing
>> >>>>
>> >>>>
>> >>>> Oddly(to me anyways), this pg is now active+clean:
>> >>>>
>> >>>> # ceph pg dump  | grep 70.459
>> >>>> dumped all in format plain
>> >>>> 70.459 21377 0 0 0 0 64515446306 3088 3088 active+clean 2016-03-07
>> >>>> 16:26:57.796537 279563'212832 279602:628151 [307,210,273,191,132,450]
>> >>>> 307
>> >>>> [307,210,273,191,132,450] 307 279563'212832 2016-03-07
>> >>>> 16:12:30.741984
>> >>>> 279563'212832 2016-03-07 16:12:30.741984
>> >>>>
>> >>>>
>> >>>>
>> >>>> 

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-07 Thread Samuel Just
Jeffrey: can you confirm through the admin socket the versions running
on each of those osds and include the output in your reply?  I have a
theory about what's causing the objects to be erroneously reported as
inconsistent, but it requires that osd.307 be running a different
version.
-Sam

On Mon, Mar 7, 2016 at 3:34 PM, Samuel Just <sj...@redhat.com> wrote:
> Well, the fact that different objects are being selected as
> inconsistent strongly suggests that the objects are not actually
> inconsistent.  Thus, at the moment my assumption is a bug in scrub...
> -Sam
>
> On Mon, Mar 7, 2016 at 3:31 PM, Shinobu Kinjo <shinobu...@gmail.com> wrote:
>> What could cause this kind of unexpected behaviour?
>> Any assumption??
>> Sorry for interrupting you.
>>
>> Cheers,
>> S
>>
>> On Tue, Mar 8, 2016 at 8:19 AM, Samuel Just <sj...@redhat.com> wrote:
>>> Hmm, at the end of the log, the pg is still inconsistent.  Can you
>>> attach a ceph pg query on that pg?
>>> -Sam
>>>
>>> On Mon, Mar 7, 2016 at 3:05 PM, Samuel Just <sj...@redhat.com> wrote:
>>>> If so, that strongly suggests that the pg was actually never
>>>> inconsistent in the first place and that the bug is in scrub itself
>>>> presumably getting confused about an object during a write.  The next
>>>> step would be to get logs like the above from a pg as it scrubs
>>>> transitioning from clean to inconsistent.  If it's really a race
>>>> between scrub and a write, it's probably just non-deterministic, you
>>>> could set logging on a set of osds and continuously scrub any pgs
>>>> which only map to those osds until you reproduce the problem.
>>>> -Sam
>>>>
>>>> On Mon, Mar 7, 2016 at 2:44 PM, Samuel Just <sj...@redhat.com> wrote:
>>>>> So after the scrub, it came up clean?  The inconsistent/missing
>>>>> objects reappeared?
>>>>> -Sam
>>>>>
>>>>> On Mon, Mar 7, 2016 at 2:33 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>>>>>> Hi Sam,
>>>>>>
>>>>>> I've done as you requested:
>>>>>>
>>>>>> pg 70.459 is active+clean+inconsistent, acting [307,210,273,191,132,450]
>>>>>>
>>>>>> # for i in 307 210 273 191 132 450 ; do
>>>>>>> ceph tell osd.$i injectargs  '--debug-osd 20 --debug-filestore 20
>>>>>>> --debug-ms 1'
>>>>>>> done
>>>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>>>
>>>>>>
>>>>>> # date
>>>>>> Mon Mar  7 16:03:38 CST 2016
>>>>>>
>>>>>>
>>>>>> # ceph pg deep-scrub 70.459
>>>>>> instructing pg 70.459 on osd.307 to deep-scrub
>>>>>>
>>>>>>
>>>>>>
>>>>>> Scrub finished around
>>>>>>
>>>>>> # date
>>>>>> Mon Mar  7 16:13:03 CST 2016
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I've tar'd+gziped the files which can be downloaded from here.   The logs
>>>>>> start a minute or two after today at 16:00.
>>>>>>
>>>>>> https://drive.google.com/folderview?id=0Bzz8TrxFvfema2NQUmotd1BOTnM=sharing
>>>>>>
>>>>>>
>>>>>> Oddly(to me anyways), this pg is now active+clean:
>>>>>>
>>>>>> # ceph pg dump  | grep 70.459
>>>>>> dumped all in format plain
>>>>>> 70.459 21377 0 0 0 0 64515446306 3088 3088 active+clean 2016-03-07
>>>>>> 16:26:57.796537 279563'212832 279602:628151 [307,210,273,191,132,450] 307
>>>>>> [307,210,273,191,132,450] 307 279563'212832 2016-03-07 16:12:30.741984
>>>>>> 279563'212832 2016-03-07 16:12:30.741984
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Jeff
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 7, 2016 at 

Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-07 Thread Samuel Just
Well, the fact that different objects are being selected as
inconsistent strongly suggests that the objects are not actually
inconsistent.  Thus, at the moment my assumption is a bug in scrub...
-Sam

On Mon, Mar 7, 2016 at 3:31 PM, Shinobu Kinjo <shinobu...@gmail.com> wrote:
> What could cause this kind of unexpected behaviour?
> Any assumption??
> Sorry for interrupting you.
>
> Cheers,
> S
>
> On Tue, Mar 8, 2016 at 8:19 AM, Samuel Just <sj...@redhat.com> wrote:
>> Hmm, at the end of the log, the pg is still inconsistent.  Can you
>> attach a ceph pg query on that pg?
>> -Sam
>>
>> On Mon, Mar 7, 2016 at 3:05 PM, Samuel Just <sj...@redhat.com> wrote:
>>> If so, that strongly suggests that the pg was actually never
>>> inconsistent in the first place and that the bug is in scrub itself
>>> presumably getting confused about an object during a write.  The next
>>> step would be to get logs like the above from a pg as it scrubs
>>> transitioning from clean to inconsistent.  If it's really a race
>>> between scrub and a write, it's probably just non-deterministic, you
>>> could set logging on a set of osds and continuously scrub any pgs
>>> which only map to those osds until you reproduce the problem.
>>> -Sam
>>>
>>> On Mon, Mar 7, 2016 at 2:44 PM, Samuel Just <sj...@redhat.com> wrote:
>>>> So after the scrub, it came up clean?  The inconsistent/missing
>>>> objects reappeared?
>>>> -Sam
>>>>
>>>> On Mon, Mar 7, 2016 at 2:33 PM, Jeffrey McDonald <jmcdo...@umn.edu> wrote:
>>>>> Hi Sam,
>>>>>
>>>>> I've done as you requested:
>>>>>
>>>>> pg 70.459 is active+clean+inconsistent, acting [307,210,273,191,132,450]
>>>>>
>>>>> # for i in 307 210 273 191 132 450 ; do
>>>>>> ceph tell osd.$i injectargs  '--debug-osd 20 --debug-filestore 20
>>>>>> --debug-ms 1'
>>>>>> done
>>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>>
>>>>>
>>>>> # date
>>>>> Mon Mar  7 16:03:38 CST 2016
>>>>>
>>>>>
>>>>> # ceph pg deep-scrub 70.459
>>>>> instructing pg 70.459 on osd.307 to deep-scrub
>>>>>
>>>>>
>>>>>
>>>>> Scrub finished around
>>>>>
>>>>> # date
>>>>> Mon Mar  7 16:13:03 CST 2016
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I've tar'd+gziped the files which can be downloaded from here.   The logs
>>>>> start a minute or two after today at 16:00.
>>>>>
>>>>> https://drive.google.com/folderview?id=0Bzz8TrxFvfema2NQUmotd1BOTnM=sharing
>>>>>
>>>>>
>>>>> Oddly(to me anyways), this pg is now active+clean:
>>>>>
>>>>> # ceph pg dump  | grep 70.459
>>>>> dumped all in format plain
>>>>> 70.459 21377 0 0 0 0 64515446306 3088 3088 active+clean 2016-03-07
>>>>> 16:26:57.796537 279563'212832 279602:628151 [307,210,273,191,132,450] 307
>>>>> [307,210,273,191,132,450] 307 279563'212832 2016-03-07 16:12:30.741984
>>>>> 279563'212832 2016-03-07 16:12:30.741984
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Jeff
>>>>>
>>>>>
>>>>> On Mon, Mar 7, 2016 at 4:11 PM, Samuel Just <sj...@redhat.com> wrote:
>>>>>>
>>>>>> I think the unfound object on repair is fixed by
>>>>>> d51806f5b330d5f112281fbb95ea6addf994324e (not in hammer yet).  I
>>>>>> opened http://tracker.ceph.com/issues/15002 for the backport and to
>>>>>> make sure it's covered in ceph-qa-suite.  No idea at this time why the
>>>>>> objects are disappearing though.
>>>>>> -Sam
>>>>>>
>>>>>> On Mon, Mar 7, 2016 at 1:57 PM, Samuel Just <sj...@redhat.com> wrote:
>>>>>> > The one just scrubbed and now inconsistent.
>

  1   2   3   >