Re: [ceph-users] disk usage reported incorrectly
Fix is on its way too... See https://github.com/ceph/ceph/pull/28978 On 7/17/2019 8:55 PM, Paul Mezzanini wrote: Oh my. That's going to hurt with 788 OSDs. Time for some creative shell scripts and stepping through the nodes. I'll report back. -- Paul Mezzanini Sr Systems Administrator / Engineer, Research Computing Information & Technology Services Finance & Administration Rochester Institute of Technology o:(585) 475-3245 | pfm...@rit.edu CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. From: Igor Fedotov Sent: Wednesday, July 17, 2019 11:33 AM To: Paul Mezzanini; ceph-users@lists.ceph.com Subject: Re: [ceph-users] disk usage reported incorrectly Forgot to provide a workaround... If that's the case then you need to repair each OSD with corresponding command in ceph-objectstore-tool... Thanks, Igor. On 7/17/2019 6:29 PM, Paul Mezzanini wrote: Sometime after our upgrade to Nautilus our disk usage statistics went off the rails wrong. I can't tell you exactly when it broke but I know that after the initial upgrade it worked at least for a bit. Correct numbers should be something similar to: (These are copy/pasted from the autoscale-status report) POOLSIZE cephfs_metadata 327.1G cold-ec98.36T ceph-bulk-3r142.6T cephfs_data31890G ceph-hot-2r5276G kgcoe-cinder103.2T rbd 3098 Instead, we now show: POOL SIZE cephfs_metadata362.9G (correct) cold-ec607.2G(wrong) ceph-bulk-3r5186G (wrong) cephfs_data1654G (wrong) ceph-hot-2r5884G (correct I think) kgcoe-cinder5761G (wrong) rbd128.0k `ceph fs status` reports similar numbers. cold-ec, ceph-hot-2r and cephfs_data are all cephfs data pools and cephfs_metadata is unsurprisingly, cephfs metadata. The remaining pools are all used for rbd. Interestingly, the `ceph df` outpool for raw storage feels correct for each drive class while the pool usage is wrong: RAW STORAGE: CLASS SIZEAVAIL USEDRAW USED %RAW USED hdd 6.3 PiB 5.2 PiB 1.1 PiB 1.1 PiB 17.08 nvme 175 TiB 161 TiB 14 TiB 14 TiB 7.82 nvme-meta 14 TiB 11 TiB 2.2 TiB 2.5 TiB 18.45 TOTAL 6.5 PiB 5.4 PiB 1.1 PiB 1.1 PiB 16.84 POOLS: POOLID STORED OBJECTS USED%USED MAX AVAIL kgcoe-cinder24 1.9 TiB 29.49M 5.6 TiB 0.32 582 TiB ceph-bulk-3r32 1.7 TiB 88.28M 5.1 TiB 0.29 582 TiB cephfs_data 35 518 GiB 135.68M 1.6 TiB 0.09 582 TiB cephfs_metadata 36 363 GiB 5.63M 363 GiB 3.35 3.4 TiB rbd 37 931 B 5 128 KiB 0 582 TiB ceph-hot-2r 50 5.7 TiB 18.63M 5.7 TiB 3.72 74 TiB cold-ec 51 417 GiB 105.23M 607 GiB 0.02 2.1 PiB Everything is on "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)" and kernel 5.0.21 or 5.0.9. I'm actually doing the patching now to pull the ceph cluster up to 5.0.21, same as the clients. I'm not really sure where to dig into this one. Everything is working fine except disk usage reporting. This also completely blows up the autoscaler. I feel like the question is obvious but I'll state it anyway. How do I get this issue resolved? Thanks -paul -- Paul Mezzanini Sr Systems Administrator / Engineer, Research Computing Information & Technology Services Finance & Administration Rochester Institute of Technology o:(585) 475-3245 | pfm...@rit.edu CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com __
Re: [ceph-users] disk usage reported incorrectly
Oh my. That's going to hurt with 788 OSDs. Time for some creative shell scripts and stepping through the nodes. I'll report back. -- Paul Mezzanini Sr Systems Administrator / Engineer, Research Computing Information & Technology Services Finance & Administration Rochester Institute of Technology o:(585) 475-3245 | pfm...@rit.edu CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. From: Igor Fedotov Sent: Wednesday, July 17, 2019 11:33 AM To: Paul Mezzanini; ceph-users@lists.ceph.com Subject: Re: [ceph-users] disk usage reported incorrectly Forgot to provide a workaround... If that's the case then you need to repair each OSD with corresponding command in ceph-objectstore-tool... Thanks, Igor. On 7/17/2019 6:29 PM, Paul Mezzanini wrote: > Sometime after our upgrade to Nautilus our disk usage statistics went off the > rails wrong. I can't tell you exactly when it broke but I know that after > the initial upgrade it worked at least for a bit. > > Correct numbers should be something similar to: (These are copy/pasted from > the autoscale-status report) > > POOLSIZE > cephfs_metadata 327.1G > cold-ec98.36T > ceph-bulk-3r142.6T > cephfs_data31890G > ceph-hot-2r5276G > kgcoe-cinder103.2T > rbd 3098 > > > Instead, we now show: > > POOL SIZE > cephfs_metadata362.9G (correct) > cold-ec607.2G(wrong) > ceph-bulk-3r5186G (wrong) > cephfs_data1654G (wrong) > ceph-hot-2r5884G (correct I think) > kgcoe-cinder5761G (wrong) > rbd128.0k > > > `ceph fs status` reports similar numbers. cold-ec, ceph-hot-2r and > cephfs_data are all cephfs data pools and cephfs_metadata is unsurprisingly, > cephfs metadata. The remaining pools are all used for rbd. > > > Interestingly, the `ceph df` outpool for raw storage feels correct for each > drive class while the pool usage is wrong: > > RAW STORAGE: > CLASS SIZEAVAIL USEDRAW USED %RAW USED > hdd 6.3 PiB 5.2 PiB 1.1 PiB 1.1 PiB 17.08 > nvme 175 TiB 161 TiB 14 TiB 14 TiB 7.82 > nvme-meta 14 TiB 11 TiB 2.2 TiB 2.5 TiB 18.45 > TOTAL 6.5 PiB 5.4 PiB 1.1 PiB 1.1 PiB 16.84 > > POOLS: > POOLID STORED OBJECTS USED%USED > MAX AVAIL > kgcoe-cinder24 1.9 TiB 29.49M 5.6 TiB 0.32 > 582 TiB > ceph-bulk-3r32 1.7 TiB 88.28M 5.1 TiB 0.29 > 582 TiB > cephfs_data 35 518 GiB 135.68M 1.6 TiB 0.09 > 582 TiB > cephfs_metadata 36 363 GiB 5.63M 363 GiB 3.35 > 3.4 TiB > rbd 37 931 B 5 128 KiB 0 > 582 TiB > ceph-hot-2r 50 5.7 TiB 18.63M 5.7 TiB 3.72 >74 TiB > cold-ec 51 417 GiB 105.23M 607 GiB 0.02 > 2.1 PiB > > > Everything is on "ceph version 14.2.1 > (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)" and kernel > 5.0.21 or 5.0.9. I'm actually doing the patching now to pull the ceph > cluster up to 5.0.21, same as the clients. I'm not really sure where to dig > into this one. Everything is working fine except disk usage reporting. This > also completely blows up the autoscaler. > > I feel like the question is obvious but I'll state it anyway. How do I get > this issue resolved? > > Thanks > -paul > > -- > Paul Mezzanini > Sr Systems Administrator / Engineer, Research Computing > Information & Technology Services > Finance & Administration > Rochester Institute of Technology > o:(585) 475-3245 | pfm...@rit.edu > > CONFIDENTIALITY NOTE: The information transmitted, including attachments, is > intended only for the person(s) or entity to which it is addressed and may > contain confidential and/or privileged material. Any review, retransmission, > dissemination or other use of, or taking of any action in reliance upon this > information by persons or entities other than the intended recipient is > prohib
Re: [ceph-users] disk usage reported incorrectly
Forgot to provide a workaround... If that's the case then you need to repair each OSD with corresponding command in ceph-objectstore-tool... Thanks, Igor. On 7/17/2019 6:29 PM, Paul Mezzanini wrote: Sometime after our upgrade to Nautilus our disk usage statistics went off the rails wrong. I can't tell you exactly when it broke but I know that after the initial upgrade it worked at least for a bit. Correct numbers should be something similar to: (These are copy/pasted from the autoscale-status report) POOLSIZE cephfs_metadata 327.1G cold-ec98.36T ceph-bulk-3r142.6T cephfs_data31890G ceph-hot-2r5276G kgcoe-cinder103.2T rbd 3098 Instead, we now show: POOL SIZE cephfs_metadata362.9G (correct) cold-ec607.2G(wrong) ceph-bulk-3r5186G (wrong) cephfs_data1654G (wrong) ceph-hot-2r5884G (correct I think) kgcoe-cinder5761G (wrong) rbd128.0k `ceph fs status` reports similar numbers. cold-ec, ceph-hot-2r and cephfs_data are all cephfs data pools and cephfs_metadata is unsurprisingly, cephfs metadata. The remaining pools are all used for rbd. Interestingly, the `ceph df` outpool for raw storage feels correct for each drive class while the pool usage is wrong: RAW STORAGE: CLASS SIZEAVAIL USEDRAW USED %RAW USED hdd 6.3 PiB 5.2 PiB 1.1 PiB 1.1 PiB 17.08 nvme 175 TiB 161 TiB 14 TiB 14 TiB 7.82 nvme-meta 14 TiB 11 TiB 2.2 TiB 2.5 TiB 18.45 TOTAL 6.5 PiB 5.4 PiB 1.1 PiB 1.1 PiB 16.84 POOLS: POOLID STORED OBJECTS USED%USED MAX AVAIL kgcoe-cinder24 1.9 TiB 29.49M 5.6 TiB 0.32 582 TiB ceph-bulk-3r32 1.7 TiB 88.28M 5.1 TiB 0.29 582 TiB cephfs_data 35 518 GiB 135.68M 1.6 TiB 0.09 582 TiB cephfs_metadata 36 363 GiB 5.63M 363 GiB 3.35 3.4 TiB rbd 37 931 B 5 128 KiB 0 582 TiB ceph-hot-2r 50 5.7 TiB 18.63M 5.7 TiB 3.72 74 TiB cold-ec 51 417 GiB 105.23M 607 GiB 0.02 2.1 PiB Everything is on "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)" and kernel 5.0.21 or 5.0.9. I'm actually doing the patching now to pull the ceph cluster up to 5.0.21, same as the clients. I'm not really sure where to dig into this one. Everything is working fine except disk usage reporting. This also completely blows up the autoscaler. I feel like the question is obvious but I'll state it anyway. How do I get this issue resolved? Thanks -paul -- Paul Mezzanini Sr Systems Administrator / Engineer, Research Computing Information & Technology Services Finance & Administration Rochester Institute of Technology o:(585) 475-3245 | pfm...@rit.edu CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] disk usage reported incorrectly
H Paul, there was a post from Sage named "Pool stats issue with upgrades to nautilus" recently. Perhaps that's the case if you add new OSD or repair existing one... Thanks, Igor On 7/17/2019 6:29 PM, Paul Mezzanini wrote: Sometime after our upgrade to Nautilus our disk usage statistics went off the rails wrong. I can't tell you exactly when it broke but I know that after the initial upgrade it worked at least for a bit. Correct numbers should be something similar to: (These are copy/pasted from the autoscale-status report) POOLSIZE cephfs_metadata 327.1G cold-ec98.36T ceph-bulk-3r142.6T cephfs_data31890G ceph-hot-2r5276G kgcoe-cinder103.2T rbd 3098 Instead, we now show: POOL SIZE cephfs_metadata362.9G (correct) cold-ec607.2G(wrong) ceph-bulk-3r5186G (wrong) cephfs_data1654G (wrong) ceph-hot-2r5884G (correct I think) kgcoe-cinder5761G (wrong) rbd128.0k `ceph fs status` reports similar numbers. cold-ec, ceph-hot-2r and cephfs_data are all cephfs data pools and cephfs_metadata is unsurprisingly, cephfs metadata. The remaining pools are all used for rbd. Interestingly, the `ceph df` outpool for raw storage feels correct for each drive class while the pool usage is wrong: RAW STORAGE: CLASS SIZEAVAIL USEDRAW USED %RAW USED hdd 6.3 PiB 5.2 PiB 1.1 PiB 1.1 PiB 17.08 nvme 175 TiB 161 TiB 14 TiB 14 TiB 7.82 nvme-meta 14 TiB 11 TiB 2.2 TiB 2.5 TiB 18.45 TOTAL 6.5 PiB 5.4 PiB 1.1 PiB 1.1 PiB 16.84 POOLS: POOLID STORED OBJECTS USED%USED MAX AVAIL kgcoe-cinder24 1.9 TiB 29.49M 5.6 TiB 0.32 582 TiB ceph-bulk-3r32 1.7 TiB 88.28M 5.1 TiB 0.29 582 TiB cephfs_data 35 518 GiB 135.68M 1.6 TiB 0.09 582 TiB cephfs_metadata 36 363 GiB 5.63M 363 GiB 3.35 3.4 TiB rbd 37 931 B 5 128 KiB 0 582 TiB ceph-hot-2r 50 5.7 TiB 18.63M 5.7 TiB 3.72 74 TiB cold-ec 51 417 GiB 105.23M 607 GiB 0.02 2.1 PiB Everything is on "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)" and kernel 5.0.21 or 5.0.9. I'm actually doing the patching now to pull the ceph cluster up to 5.0.21, same as the clients. I'm not really sure where to dig into this one. Everything is working fine except disk usage reporting. This also completely blows up the autoscaler. I feel like the question is obvious but I'll state it anyway. How do I get this issue resolved? Thanks -paul -- Paul Mezzanini Sr Systems Administrator / Engineer, Research Computing Information & Technology Services Finance & Administration Rochester Institute of Technology o:(585) 475-3245 | pfm...@rit.edu CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] disk usage reported incorrectly
Sometime after our upgrade to Nautilus our disk usage statistics went off the rails wrong. I can't tell you exactly when it broke but I know that after the initial upgrade it worked at least for a bit. Correct numbers should be something similar to: (These are copy/pasted from the autoscale-status report) POOLSIZE cephfs_metadata 327.1G cold-ec98.36T ceph-bulk-3r142.6T cephfs_data31890G ceph-hot-2r5276G kgcoe-cinder103.2T rbd 3098 Instead, we now show: POOL SIZE cephfs_metadata362.9G (correct) cold-ec607.2G(wrong) ceph-bulk-3r5186G (wrong) cephfs_data1654G (wrong) ceph-hot-2r5884G (correct I think) kgcoe-cinder5761G (wrong) rbd128.0k `ceph fs status` reports similar numbers. cold-ec, ceph-hot-2r and cephfs_data are all cephfs data pools and cephfs_metadata is unsurprisingly, cephfs metadata. The remaining pools are all used for rbd. Interestingly, the `ceph df` outpool for raw storage feels correct for each drive class while the pool usage is wrong: RAW STORAGE: CLASS SIZEAVAIL USEDRAW USED %RAW USED hdd 6.3 PiB 5.2 PiB 1.1 PiB 1.1 PiB 17.08 nvme 175 TiB 161 TiB 14 TiB 14 TiB 7.82 nvme-meta 14 TiB 11 TiB 2.2 TiB 2.5 TiB 18.45 TOTAL 6.5 PiB 5.4 PiB 1.1 PiB 1.1 PiB 16.84 POOLS: POOLID STORED OBJECTS USED%USED MAX AVAIL kgcoe-cinder24 1.9 TiB 29.49M 5.6 TiB 0.32 582 TiB ceph-bulk-3r32 1.7 TiB 88.28M 5.1 TiB 0.29 582 TiB cephfs_data 35 518 GiB 135.68M 1.6 TiB 0.09 582 TiB cephfs_metadata 36 363 GiB 5.63M 363 GiB 3.35 3.4 TiB rbd 37 931 B 5 128 KiB 0 582 TiB ceph-hot-2r 50 5.7 TiB 18.63M 5.7 TiB 3.72 74 TiB cold-ec 51 417 GiB 105.23M 607 GiB 0.02 2.1 PiB Everything is on "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)" and kernel 5.0.21 or 5.0.9. I'm actually doing the patching now to pull the ceph cluster up to 5.0.21, same as the clients. I'm not really sure where to dig into this one. Everything is working fine except disk usage reporting. This also completely blows up the autoscaler. I feel like the question is obvious but I'll state it anyway. How do I get this issue resolved? Thanks -paul -- Paul Mezzanini Sr Systems Administrator / Engineer, Research Computing Information & Technology Services Finance & Administration Rochester Institute of Technology o:(585) 475-3245 | pfm...@rit.edu CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com