Re: [ceph-users] [lists.ceph.com代发]Re: MDS Crashing 14.2.1

2019-05-17 Thread Sergey Malinin
I've had similar problem twice (with mimic) and in both cases I ended up 
backing up and restoring to a fresh fs. Did you do MDS scrub after recovery? My 
experience insists that recovering dup inodes is not a trivial process: my MDS 
kept crashing on unlink() in some directories, and in other case newly created 
fs entries would not pass MDS scrub due to linkage errors.


May 17, 2019 3:40 PM, "Adam Tygart"  wrote:

> I followed the docs from here:
> http://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disaster-recovery-experts
> 
> I exported the journals as a backup for both ranks. I was running 2
> active MDS daemons at the time.
> 
> cephfs-journal-tool --rank=combined:0 journal export
> cephfs-journal-0-201905161412.bin
> cephfs-journal-tool --rank=combined:1 journal export
> cephfs-journal-1-201905161412.bin
> 
> I recovered the Dentries on both ranks
> cephfs-journal-tool --rank=combined:0 event recover_dentries summary
> cephfs-journal-tool --rank=combined:1 event recover_dentries summary
> 
> I reset the journals of both ranks:
> cephfs-journal-tool --rank=combined:1 journal reset
> cephfs-journal-tool --rank=combined:0 journal reset
> 
> Then I reset the session table
> cephfs-table-tool all reset session
> 
> Once that was done, reboot all machines that were talking to cephfs
> (or at least unmount/remount).
> 
> On Fri, May 17, 2019 at 2:30 AM  wrote:
> 
>> Hi
>> Can you tell me the detail recovery cmd ?
>> 
>> I just started learning cephfs ,I would be grateful.
>> 
>> 发件人: Adam Tygart 
>> 收件人: Ceph Users 
>> 日期: 2019/05/17 09:04
>> 主题: [lists.ceph.com代发]Re: [ceph-users] MDS Crashing 14.2.1
>> 发件人: "ceph-users" 
>> 
>> 
>> I ended up backing up the journals of the MDS ranks, recover_dentries for 
>> both of them, resetting
>> the journals and session table. It is back up. The recover dentries stage 
>> didn't show any errors,
>> so I'm not even sure why the MDS was asserting about duplicate inodes.
>> 
>> --
>> Adam
>> 
>> On Thu, May 16, 2019, 13:52 Adam Tygart  wrote:
>> Hello all,
>> 
>> The rank 0 mds is still asserting. Is this duplicate inode situation
>> one that I should be considering using the cephfs-journal-tool to
>> export, recover dentries and reset?
>> 
>> Thanks,
>> Adam
>> 
>> On Thu, May 16, 2019 at 12:51 AM Adam Tygart  wrote:
>> 
>> Hello all,
>> 
>> I've got a 30 node cluster serving up lots of CephFS data.
>> 
>> We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
>> this week.
>> 
>> We've been running 2 MDS daemons in an active-active setup. Tonight
>> one of the metadata daemons crashed with the following several times:
>> 
>> -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
>> s7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
>> In function 'void CIn
>> ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
>> 00:20:56.775021
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
>> s7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
>> 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val("mds_h
>> ack_allow_loading_invalid_metadata"))
>> 
>> I made a quick decision to move to a single MDS because I saw
>> set_primary_parent, and I thought it might be related to auto
>> balancing between the metadata servers.
>> 
>> This caused one MDS to fail, the other crashed, and now rank 0 loads,
>> goes active and then crashes with the following:
>> -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
>> s7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
>> In function 'void M
>> DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 
>> 00:29:21.149531
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
>> s7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
>> 258: FAILED ceph_assert(!p)
>> 
>> It now looks like we somehow have a duplicate inode in the MDS journal?
>> 
>> https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
>> then became rank one after the crash and attempted drop to one active
>> MDS
>> https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
>> and crashed
>> 
>> Anyone have any thoughts on this?
>> 
>> Thanks,
>> Adam
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com__
>> 
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 

Re: [ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat

2019-04-01 Thread Sergey Malinin
These steps pretty well correspond to 
http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ 
(http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/)
Were you able to replay journal manually with no issues? IIRC, 
"cephfs-journal-tool recover_dentries" would lead to OOM in case of MDS doing 
so, and it has already been discussed on this list.
April 2, 2019 1:37 AM, "Pickett, Neale T" mailto:ne...@lanl.gov?to=%22Pickett,%20Neale%20T%22%20)> wrote:
Here is what I wound up doing to fix this: 
* Bring down all MDSes so they stop flapping 
* Back up journal (as seen in previous message) 
* Apply journal manually 
* Reset journal manually 
* Clear session table 
* Clear other tables (not sure I needed to do this) 
* Mark FS down 
* Mark the rank 0 MDS as failed 
* Reset the FS (yes, I really mean it) 
* Restart MDSes 
* Finally get some sleep
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and many small files

2019-04-01 Thread Sergey Malinin
I haven't had any issues with 4k allocation size in cluster holding 189M files.

April 1, 2019 2:04 PM, "Paul Emmerich"  wrote:

> I'm not sure about the real-world impacts of a lower min alloc size or
> the rationale behind the default values for HDDs (64) and SSDs (16kb).
> 
> Paul
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Resizing a cache tier rbd

2019-03-27 Thread Sergey Malinin
March 27, 2019 1:09 PM, "Fyodor Ustinov" mailto:u...@ufm.su)> 
wrote:
 Tiering - deprecated? Where can I read more about this? 
Looks like it was deprecated in Red Hat Ceph Storage in 2016:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/thread.html#13867
 
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/thread.html#13867)
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality
 
(https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS performance improved in 13.2.5?

2019-03-20 Thread Sergey Malinin
Hello,
Yesterday I upgraded from 13.2.2 to 13.2.5 and so far I have only seen 
significant improvements in MDS operations. Needless to say I'm happy, but I 
didn't notice anything related in release notes. Am I missing something, 
possibly new configuration settings?

Screenshots below:
https://prnt.sc/n0qzfp (https://prnt.sc/n0qzfp)
https://prnt.sc/n0qzd5 (https://prnt.sc/n0qzd5)

And yes, ceph nodes and clients had kernel upgraded to v5.0.3
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic and cephfs

2019-02-26 Thread Sergey Malinin
I've been using fresh 13.2.2 install in production for 4 months now without any 
issues.


February 25, 2019 10:17 PM, "Andras Pataki"  
wrote:

> Hi ceph users,
> 
> As I understand, cephfs in Mimic had significant issues up to and 
> including version 13.2.2.  With some critical patches in Mimic 13.2.4, 
> is cephfs now production quality in Mimic?  Are there folks out there 
> using it in a production setting?  If so, could you share your 
> experience with is (as compared to Luminous)?
> 
> Thanks,
> 
> Andras
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover files from cephfs data pool

2018-11-05 Thread Sergey Malinin
With cppool you got bunch of useless zero-sized objects because unlike 
"export", cppool does not copy omap data which actually holds all the inodes 
info.
I suggest truncating journals only for an effort of reducing downtime followed 
by immediate backup of available files to a fresh fs. After resetting journals 
the part of your fs covered by not flushed "UPDATE" entries *will* become 
inconsistent. MDS may start to occasionally segfault but it can be avoided by 
setting forced readonly mode (in this mode MDS journal will not flush so you 
will need extra disk space).
If you want to get the original fs recovered and fully functional - you need to 
somehow replay the journal (I'm unsure whether cephfs-data-scan tool operates 
on journal entries).



> On 6.11.2018, at 03:43, Rhian Resnick  wrote:
> 
> Workload is mixed. 
> 
> We ran a rados cpool to backup the metadata pool. 
> 
> So your thinking that truncating journal and purge queue (we are luminous) 
> with a reset could bring us online missing just data from that day. (most 
> when the issue started)
> 
> If so we could continue our scan into our recovery partition and give it a 
> try tomorrow after discussions with our recovery team. 
> 
> 
> 
> 
> On Mon, Nov 5, 2018 at 7:40 PM Sergey Malinin  <mailto:h...@newmail.com>> wrote:
> What was your recent workload? There are chances not to lose much if it was 
> mostly read ops. If such, you must backup your metadata pool via "rados 
> export" in order to preserve omap data, then try truncating journals (along 
> with purge queue if supported by your ceph version), wiping session table, 
> and resetting the fs.
> 
> 
>> On 6.11.2018, at 03:26, Rhian Resnick > <mailto:xan...@sepiidae.com>> wrote:
>> 
>> That was our original plan. So we migrated to bigger disks and have space 
>> but recover dentry uses up all our memory (128 GB) and crashes out. 
>> 
>> On Mon, Nov 5, 2018 at 7:23 PM Sergey Malinin > <mailto:h...@newmail.com>> wrote:
>> I had the same problem with multi-mds. I solved it by freeing up a little 
>> space on OSDs, doing "recover dentries", truncating the journal, and then 
>> "fs reset". After that I was able to revert to single-active MDS and kept on 
>> running for a year until it failed on 13.2.2 upgrade :))
>> 
>> 
>>> On 6.11.2018, at 03:18, Rhian Resnick >> <mailto:xan...@sepiidae.com>> wrote:
>>> 
>>> Our metadata pool went from 700 MB to 1 TB in size in a few hours. Used all 
>>> space on OSD and now 2 ranks report damage. The recovery tools on the 
>>> journal fail as they run out of memory leaving us with the option of 
>>> truncating the journal and loosing data or recovering using the scan tools. 
>>> 
>>> Any ideas on solutions are welcome. I posted all the logs and and cluster 
>>> design previously but am happy to do so again. We are not desperate but we 
>>> are hurting with this long downtime. 
>>> 
>>> On Mon, Nov 5, 2018 at 7:05 PM Sergey Malinin >> <mailto:h...@newmail.com>> wrote:
>>> What kind of damage have you had? Maybe it is worth trying to get MDS to 
>>> start and backup valuable data instead of doing long running recovery?
>>> 
>>> 
>>>> On 6.11.2018, at 02:59, Rhian Resnick >>> <mailto:xan...@sepiidae.com>> wrote:
>>>> 
>>>> Sounds like I get to have some fun tonight. 
>>>> 
>>>> On Mon, Nov 5, 2018, 6:39 PM Sergey Malinin >>> <mailto:h...@newmail.com> wrote:
>>>> inode linkage (i.e. folder hierarchy) and file names are stored in omap 
>>>> data of objects in metadata pool. You can write a script that would 
>>>> traverse through all the metadata pool to find out file names correspond 
>>>> to objects in data pool and fetch required files via 'rados get' command.
>>>> 
>>>> > On 6.11.2018, at 02:26, Sergey Malinin >>> > <mailto:h...@newmail.com>> wrote:
>>>> > 
>>>> > Yes, 'rados -h'.
>>>> > 
>>>> > 
>>>> >> On 6.11.2018, at 02:25, Rhian Resnick >>> >> <mailto:xan...@sepiidae.com>> wrote:
>>>> >> 
>>>> >> Does a tool exist to recover files from a cephfs data partition? We are 
>>>> >> rebuilding metadata but have a user who needs data asap.
>>>> >> ___
>>>> >> ceph-users mailing list
>>>> >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>>> > 
>>>> 
>>> 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover files from cephfs data pool

2018-11-05 Thread Sergey Malinin
What was your recent workload? There are chances not to lose much if it was 
mostly read ops. If such, you must backup your metadata pool via "rados export" 
in order to preserve omap data, then try truncating journals (along with purge 
queue if supported by your ceph version), wiping session table, and resetting 
the fs.


> On 6.11.2018, at 03:26, Rhian Resnick  wrote:
> 
> That was our original plan. So we migrated to bigger disks and have space but 
> recover dentry uses up all our memory (128 GB) and crashes out. 
> 
> On Mon, Nov 5, 2018 at 7:23 PM Sergey Malinin  <mailto:h...@newmail.com>> wrote:
> I had the same problem with multi-mds. I solved it by freeing up a little 
> space on OSDs, doing "recover dentries", truncating the journal, and then "fs 
> reset". After that I was able to revert to single-active MDS and kept on 
> running for a year until it failed on 13.2.2 upgrade :))
> 
> 
>> On 6.11.2018, at 03:18, Rhian Resnick > <mailto:xan...@sepiidae.com>> wrote:
>> 
>> Our metadata pool went from 700 MB to 1 TB in size in a few hours. Used all 
>> space on OSD and now 2 ranks report damage. The recovery tools on the 
>> journal fail as they run out of memory leaving us with the option of 
>> truncating the journal and loosing data or recovering using the scan tools. 
>> 
>> Any ideas on solutions are welcome. I posted all the logs and and cluster 
>> design previously but am happy to do so again. We are not desperate but we 
>> are hurting with this long downtime. 
>> 
>> On Mon, Nov 5, 2018 at 7:05 PM Sergey Malinin > <mailto:h...@newmail.com>> wrote:
>> What kind of damage have you had? Maybe it is worth trying to get MDS to 
>> start and backup valuable data instead of doing long running recovery?
>> 
>> 
>>> On 6.11.2018, at 02:59, Rhian Resnick >> <mailto:xan...@sepiidae.com>> wrote:
>>> 
>>> Sounds like I get to have some fun tonight. 
>>> 
>>> On Mon, Nov 5, 2018, 6:39 PM Sergey Malinin >> <mailto:h...@newmail.com> wrote:
>>> inode linkage (i.e. folder hierarchy) and file names are stored in omap 
>>> data of objects in metadata pool. You can write a script that would 
>>> traverse through all the metadata pool to find out file names correspond to 
>>> objects in data pool and fetch required files via 'rados get' command.
>>> 
>>> > On 6.11.2018, at 02:26, Sergey Malinin >> > <mailto:h...@newmail.com>> wrote:
>>> > 
>>> > Yes, 'rados -h'.
>>> > 
>>> > 
>>> >> On 6.11.2018, at 02:25, Rhian Resnick >> >> <mailto:xan...@sepiidae.com>> wrote:
>>> >> 
>>> >> Does a tool exist to recover files from a cephfs data partition? We are 
>>> >> rebuilding metadata but have a user who needs data asap.
>>> >> ___
>>> >> ceph-users mailing list
>>> >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>> > 
>>> 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover files from cephfs data pool

2018-11-05 Thread Sergey Malinin
I had the same problem with multi-mds. I solved it by freeing up a little space 
on OSDs, doing "recover dentries", truncating the journal, and then "fs reset". 
After that I was able to revert to single-active MDS and kept on running for a 
year until it failed on 13.2.2 upgrade :))


> On 6.11.2018, at 03:18, Rhian Resnick  wrote:
> 
> Our metadata pool went from 700 MB to 1 TB in size in a few hours. Used all 
> space on OSD and now 2 ranks report damage. The recovery tools on the journal 
> fail as they run out of memory leaving us with the option of truncating the 
> journal and loosing data or recovering using the scan tools. 
> 
> Any ideas on solutions are welcome. I posted all the logs and and cluster 
> design previously but am happy to do so again. We are not desperate but we 
> are hurting with this long downtime. 
> 
> On Mon, Nov 5, 2018 at 7:05 PM Sergey Malinin  <mailto:h...@newmail.com>> wrote:
> What kind of damage have you had? Maybe it is worth trying to get MDS to 
> start and backup valuable data instead of doing long running recovery?
> 
> 
>> On 6.11.2018, at 02:59, Rhian Resnick > <mailto:xan...@sepiidae.com>> wrote:
>> 
>> Sounds like I get to have some fun tonight. 
>> 
>> On Mon, Nov 5, 2018, 6:39 PM Sergey Malinin > <mailto:h...@newmail.com> wrote:
>> inode linkage (i.e. folder hierarchy) and file names are stored in omap data 
>> of objects in metadata pool. You can write a script that would traverse 
>> through all the metadata pool to find out file names correspond to objects 
>> in data pool and fetch required files via 'rados get' command.
>> 
>> > On 6.11.2018, at 02:26, Sergey Malinin > > <mailto:h...@newmail.com>> wrote:
>> > 
>> > Yes, 'rados -h'.
>> > 
>> > 
>> >> On 6.11.2018, at 02:25, Rhian Resnick > >> <mailto:xan...@sepiidae.com>> wrote:
>> >> 
>> >> Does a tool exist to recover files from a cephfs data partition? We are 
>> >> rebuilding metadata but have a user who needs data asap.
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> > 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover files from cephfs data pool

2018-11-05 Thread Sergey Malinin
What kind of damage have you had? Maybe it is worth trying to get MDS to start 
and backup valuable data instead of doing long running recovery?


> On 6.11.2018, at 02:59, Rhian Resnick  wrote:
> 
> Sounds like I get to have some fun tonight. 
> 
> On Mon, Nov 5, 2018, 6:39 PM Sergey Malinin  <mailto:h...@newmail.com> wrote:
> inode linkage (i.e. folder hierarchy) and file names are stored in omap data 
> of objects in metadata pool. You can write a script that would traverse 
> through all the metadata pool to find out file names correspond to objects in 
> data pool and fetch required files via 'rados get' command.
> 
> > On 6.11.2018, at 02:26, Sergey Malinin  > <mailto:h...@newmail.com>> wrote:
> > 
> > Yes, 'rados -h'.
> > 
> > 
> >> On 6.11.2018, at 02:25, Rhian Resnick  >> <mailto:xan...@sepiidae.com>> wrote:
> >> 
> >> Does a tool exist to recover files from a cephfs data partition? We are 
> >> rebuilding metadata but have a user who needs data asap.
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> > 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover files from cephfs data pool

2018-11-05 Thread Sergey Malinin
inode linkage (i.e. folder hierarchy) and file names are stored in omap data of 
objects in metadata pool. You can write a script that would traverse through 
all the metadata pool to find out file names correspond to objects in data pool 
and fetch required files via 'rados get' command.

> On 6.11.2018, at 02:26, Sergey Malinin  wrote:
> 
> Yes, 'rados -h'.
> 
> 
>> On 6.11.2018, at 02:25, Rhian Resnick  wrote:
>> 
>> Does a tool exist to recover files from a cephfs data partition? We are 
>> rebuilding metadata but have a user who needs data asap.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover files from cephfs data pool

2018-11-05 Thread Sergey Malinin
Yes, 'rados -h'.


> On 6.11.2018, at 02:25, Rhian Resnick  wrote:
> 
> Does a tool exist to recover files from a cephfs data partition? We are 
> rebuilding metadata but have a user who needs data asap.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] io-schedulers

2018-11-05 Thread Sergey Malinin
Using "noop" makes sense only with ssd/nvme drives. "noop" is a simple fifo and 
using it with HDDs can result in unexpected blocking of useful IO in case when 
the queue is poisoned with burst of IO requests like background purge, which 
would become foreground in such case.


> On 5.11.2018, at 23:39, Jack  wrote:
> 
> We simply use the "noop" scheduler on our nand-based ceph cluster
> 
> 
> On 11/05/2018 09:33 PM, solarflow99 wrote:
>> I'm interested to know about this too.
>> 
>> 
>> On Mon, Nov 5, 2018 at 10:45 AM Bastiaan Visser  wrote:
>> 
>>> 
>>> There are lots of rumors around about the benefit of changing
>>> io-schedulers for OSD disks.
>>> Even some benchmarks can be found, but they are all more than a few years
>>> old.
>>> Since ceph is moving forward with quite a pace, i am wondering what the
>>> common practice is to use as io-scheduler on OSD's.
>>> 
>>> And since blk-mq is around these days, are the multi-queue schedules
>>> already being used in production clusters?
>>> 
>>> Regards,
>>> Bastiaan
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] io-schedulers

2018-11-05 Thread Sergey Malinin
It depends on store backend. Bluestore has it's own scheduler which works 
properly only with CFQ, while filestore configuration is narrowed to setting 
OSD IO threads' scheduling class and priority just like using 'ionice' utility.


> On 5.11.2018, at 21:45, Bastiaan Visser  wrote:
> 
> 
> There are lots of rumors around about the benefit of changing io-schedulers 
> for OSD disks.
> Even some benchmarks can be found, but they are all more than a few years 
> old. 
> Since ceph is moving forward with quite a pace, i am wondering what the 
> common practice is to use as io-scheduler on OSD's.
> 
> And since blk-mq is around these days, are the multi-queue schedules already 
> being used in production clusters? 
> 
> Regards,
> Bastiaan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] speeding up ceph

2018-11-05 Thread Sergey Malinin
scan_extents should not saturate links much because it doesn't read entire 
objects but only performs rados' stat() call on them which returns several 
bytes of data.
You can learn about the scan progress by monitoring pool stats via 'rados df' 
or daemon metrics.


> On 5.11.2018, at 20:02, Rhian Resnick  wrote:
> 
> What type of bandwidth did you see during the recovery process? We are seeing 
> around 2 Mbps on each box running 20 processes each.
> 
> On Mon, Nov 5, 2018 at 11:31 AM Sergey Malinin  <mailto:h...@newmail.com>> wrote:
> Although I was advised not to use caching during recovery, I didn't notice 
> any improvements after disabling it.
> 
> 
> > On 5.11.2018, at 17:32, Rhian Resnick  > <mailto:xan...@sepiidae.com>> wrote:
> > 
> > We are running cephfs-data-scan to rebuild metadata. Would changing the 
> > cache tier mode of our cephfs data partition improve performance? If so 
> > what should we switch to?
> > 
> > Thanks
> > 
> > Rhian
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] speeding up ceph

2018-11-05 Thread Sergey Malinin
Although I was advised not to use caching during recovery, I didn't notice any 
improvements after disabling it.


> On 5.11.2018, at 17:32, Rhian Resnick  wrote:
> 
> We are running cephfs-data-scan to rebuild metadata. Would changing the cache 
> tier mode of our cephfs data partition improve performance? If so what should 
> we switch to?
> 
> Thanks
> 
> Rhian
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan

2018-11-04 Thread Sergey Malinin
Keep in mind that in order for the workers not to overlap each other you need 
to set the total number of workers (worker_m) to nodes*20, and assign each node 
with it’s own processing range (worker_n).
On Nov 4, 2018, 03:43 +0300, Rhian Resnick , wrote:
> Sounds like we are going to restart with 20 threads on each storage node.
>
> > On Sat, Nov 3, 2018 at 8:26 PM Sergey Malinin  wrote:
> > > scan_extents using 8 threads took 82 hours for my cluster holding 120M 
> > > files on 12 OSDs with 1gbps between nodes. I would have gone with lot 
> > > more threads if I had known it only operated on data pool and the only 
> > > problem was network latency. If I recall correctly, each worker used up 
> > > to 800mb ram so beware the OOM killer.
> > > scan_inodes runs several times faster but I don’t remember exact timing.
> > > In your case I believe scan_extents & scan_inodes can be done in a few 
> > > hours by running the tool on each OSD node, but scan_links will be 
> > > painfully slow due to it’s single-threaded nature.
> > > In my case I ended up getting MDS to start and copied all data to a fresh 
> > > filesystem ignoring errors.
> > > On Nov 4, 2018, 02:22 +0300, Rhian Resnick , wrote:
> > > > For a 150TB file system with 40 Million files how many cephfs-data-scan 
> > > > threads should be used? Or what is the expected run time. (we have 160 
> > > > osd with 4TB disks.)
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan

2018-11-03 Thread Sergey Malinin
scan_extents using 8 threads took 82 hours for my cluster holding 120M files on 
12 OSDs with 1gbps between nodes. I would have gone with lot more threads if I 
had known it only operated on data pool and the only problem was network 
latency. If I recall correctly, each worker used up to 800mb ram so beware the 
OOM killer.
scan_inodes runs several times faster but I don’t remember exact timing.
In your case I believe scan_extents & scan_inodes can be done in a few hours by 
running the tool on each OSD node, but scan_links will be painfully slow due to 
it’s single-threaded nature.
In my case I ended up getting MDS to start and copied all data to a fresh 
filesystem ignoring errors.
On Nov 4, 2018, 02:22 +0300, Rhian Resnick , wrote:
> For a 150TB file system with 40 Million files how many cephfs-data-scan 
> threads should be used? Or what is the expected run time. (we have 160 osd 
> with 4TB disks.)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub errors

2018-10-23 Thread Sergey Malinin
There is an osd_scrub_auto_repair setting which defaults to 'false'.


> On 23.10.2018, at 12:12, Dominque Roux  wrote:
> 
> Hi all,
> 
> We lately faced several scrub errors.
> All of them were more or less easily fixed with the ceph pg repair X.Y
> command.
> 
> We're using ceph version 12.2.7 and have SSD and HDD pools.
> 
> Is there a way to prevent our datastore from these kind of errors, or is
> there a way to automate the fix (It would be rather easy to create a
> bash script)
> 
> Thank you very much for your help!
> 
> Best regards,
> 
> Dominique
> 
> -- 
> Your Swiss, Open Source and IPv6 Virtual Machine. Now on
> www.datacenterlight.ch
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-21 Thread Sergey Malinin
It is just a block size and it has no impact on data safety except that OSDs 
need to be redeployed in order for them to create bluefs with given block size.


> On 21.10.2018, at 19:04, Waterbly, Dan  wrote:
> 
> Thanks Sergey!
> 
> Do you know where I can find details on the repercussions of adjusting this 
> value? Performance (read/writes), for once, not critical for us, data 
> durability and disaster recovery is our focus.
> 
> -Dan
> 
> Get Outlook for iOS <https://aka.ms/o0ukef>
> 
> 
> On Sun, Oct 21, 2018 at 8:37 AM -0700, "Sergey Malinin"  <mailto:h...@newmail.com>> wrote:
> 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024589.html 
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024589.html>
> 
> 
>> On 21.10.2018, at 16:12, Waterbly, Dan > <mailto:dan.water...@sos.wa.gov>> wrote:
>> 
>> Awesome! Thanks Serian!
>> 
>> Do you know where the 64KB comes from? Can that be tuned down for a cluster 
>> holding smaller objects?
>> 
>> Get Outlook for iOS <https://aka.ms/o0ukef>
>> 
>> 
>> On Sat, Oct 20, 2018 at 10:49 PM -0700, "Serkan Çoban" 
>> mailto:cobanser...@gmail.com>> wrote:
>> 
>> you have 24M objects, not 2.4M.
>> Each object will eat 64KB of storage, so 24M objects uses 1.5TB storage.
>> Add 3x replication to that, it is 4.5TB
>> 
>> On Sat, Oct 20, 2018 at 11:47 PM Waterbly, Dan  wrote:
>> >
>> > Hi Jakub,
>> >
>> > No, my setup seems to be the same as yours. Our system is mainly for 
>> > archiving loads of data. This data has to be stored forever and allow 
>> > reads, albeit seldom considering the number of objects we will store vs 
>> > the number of objects that ever will be requested.
>> >
>> > It just really seems odd that the metadata surrounding the 25M objects is 
>> > so high.
>> >
>> > We have 144 osds on 9 storage nodes. Perhaps it makes perfect sense but 
>> > I’d like to know why we are seeing what we are and how it all adds up.
>> >
>> > Thanks!
>> > Dan
>> >
>> > Get Outlook for iOS
>> >
>> >
>> >
>> > On Sat, Oct 20, 2018 at 12:36 PM -0700, "Jakub Jaszewski"  wrote:
>> >
>> >> Hi Dan,
>> >>
>> >> Did you configure block.wal/block.db as separate devices/partition 
>> >> (osd_scenario: non-collocated or lvm for clusters installed using 
>> >> ceph-ansbile playbooks )?
>> >>
>> >> I run Ceph version 13.2.1 with non-collocated data.db and have the same 
>> >> situation - the sum of block.db partitions' size is displayed as RAW USED 
>> >> in ceph df.
>> >> Perhaps it is not the case for collocated block.db/wal.
>> >>
>> >> Jakub
>> >>
>> >> On Sat, Oct 20, 2018 at 8:34 PM Waterbly, Dan  wrote:
>> >>>
>> >>> I get that, but isn’t 4TiB to track 2.45M objects excessive? These 
>> >>> numbers seem very high to me.
>> >>>
>> >>> Get Outlook for iOS
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban"  wrote:
>> >>>
>> >>>> 4.65TiB includes size of wal and db partitions too.
>> >>>> On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
>> >>>> >
>> >>>> > Hello,
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
>> >>>> > replication).
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > I am confused by the usage ceph df is reporting and am hoping someone 
>> >>>> > can shed some light on this. Here is what I see when I run ceph df
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > GLOBAL:
>> >>>> >
>> >>>> > SIZEAVAIL   RAW USED %RAW USED
>> >>>> >
>> >>>> > 1.02PiB 1.02PiB  4.65TiB  0.44
>> >>>> >
>> >>>> > POOLS:
>> >>>> >
>> >>>> > NAME   ID USED
>> >>>> > %U

Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-21 Thread Sergey Malinin
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024589.html 



> On 21.10.2018, at 16:12, Waterbly, Dan  wrote:
> 
> Awesome! Thanks Serian!
> 
> Do you know where the 64KB comes from? Can that be tuned down for a cluster 
> holding smaller objects?
> 
> Get Outlook for iOS 
> 
> 
> On Sat, Oct 20, 2018 at 10:49 PM -0700, "Serkan Çoban"  > wrote:
> 
> you have 24M objects, not 2.4M.
> Each object will eat 64KB of storage, so 24M objects uses 1.5TB storage.
> Add 3x replication to that, it is 4.5TB
> 
> On Sat, Oct 20, 2018 at 11:47 PM Waterbly, Dan  wrote:
> >
> > Hi Jakub,
> >
> > No, my setup seems to be the same as yours. Our system is mainly for 
> > archiving loads of data. This data has to be stored forever and allow 
> > reads, albeit seldom considering the number of objects we will store vs the 
> > number of objects that ever will be requested.
> >
> > It just really seems odd that the metadata surrounding the 25M objects is 
> > so high.
> >
> > We have 144 osds on 9 storage nodes. Perhaps it makes perfect sense but I’d 
> > like to know why we are seeing what we are and how it all adds up.
> >
> > Thanks!
> > Dan
> >
> > Get Outlook for iOS
> >
> >
> >
> > On Sat, Oct 20, 2018 at 12:36 PM -0700, "Jakub Jaszewski"  wrote:
> >
> >> Hi Dan,
> >>
> >> Did you configure block.wal/block.db as separate devices/partition 
> >> (osd_scenario: non-collocated or lvm for clusters installed using 
> >> ceph-ansbile playbooks )?
> >>
> >> I run Ceph version 13.2.1 with non-collocated data.db and have the same 
> >> situation - the sum of block.db partitions' size is displayed as RAW USED 
> >> in ceph df.
> >> Perhaps it is not the case for collocated block.db/wal.
> >>
> >> Jakub
> >>
> >> On Sat, Oct 20, 2018 at 8:34 PM Waterbly, Dan  wrote:
> >>>
> >>> I get that, but isn’t 4TiB to track 2.45M objects excessive? These 
> >>> numbers seem very high to me.
> >>>
> >>> Get Outlook for iOS
> >>>
> >>>
> >>>
> >>> On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban"  wrote:
> >>>
>  4.65TiB includes size of wal and db partitions too.
>  On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
>  >
>  > Hello,
>  >
>  >
>  >
>  > I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
>  > replication).
>  >
>  >
>  >
>  > I am confused by the usage ceph df is reporting and am hoping someone 
>  > can shed some light on this. Here is what I see when I run ceph df
>  >
>  >
>  >
>  > GLOBAL:
>  >
>  > SIZEAVAIL   RAW USED %RAW USED
>  >
>  > 1.02PiB 1.02PiB  4.65TiB  0.44
>  >
>  > POOLS:
>  >
>  > NAME   ID USED
>  > %USED MAX AVAIL OBJECTS
>  >
>  > .rgw.root  1  3.30KiB  
>  >0330TiB   17
>  >
>  > .rgw.buckets.data  2  22.9GiB 0330TiB 
>  > 24550943
>  >
>  > default.rgw.control3   0B  
>  >0330TiB8
>  >
>  > default.rgw.meta   4 373B  
>  >0330TiB3
>  >
>  > default.rgw.log5   0B  
>  >0330TiB0
>  >
>  > .rgw.control   6   0B 0330TiB  
>  >   8
>  >
>  > .rgw.meta  7  2.18KiB 0330TiB  
>  >  12
>  >
>  > .rgw.log   8   0B 0330TiB  
>  > 194
>  >
>  > .rgw.buckets.index 9   0B 0330TiB  
>  >2560
>  >
>  >
>  >
>  > Why does my bucket pool report usage of 22.9GiB but my cluster as a 
>  > whole is reporting 4.65TiB? There is nothing else on this cluster as 
>  > it was just installed and configured.
>  >
>  >
>  >
>  > Thank you for your help with this.
>  >
>  >
>  >
>  > -Dan
>  >
>  >
>  >
>  > Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
>  > dan.water...@sos.wa.gov
>  >
>  > WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
>  >
>  >
>  >
>  > ___
>  > ceph-users mailing list
>  > ceph-users@lists.ceph.com
>  > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-17 Thread Sergey Malinin
To Zheng Yan:
I'm wondering whether 'session reset' implies the below?


> On 18.10.2018, at 02:18, Alfredo Daniel Rezinovsky 
>  wrote:
> 
>rados -p cephfs_metadata rm mds0_openfiles.0

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-17 Thread Sergey Malinin
In my case I was able to bring up the fs successfully after resetting 
sessions+journal and scanning links using cephfs-data-scan tool.


> On 18.10.2018, at 02:18, Alfredo Daniel Rezinovsky 
>  wrote:
> 
> Didnt work for me. Downgraded and mds won't start.
> 
> I also needed to:
> 
>rados -p cephfs_metadata rm mds0_openfiles.0
> 
> or else mds daemon crashed.
> 
> The crash info didn't show any useful information (for me). I couldn't figure 
> this out without Zheng Yan help.
> 
> 
> On 17/10/18 17:36, Paul Emmerich wrote:
>> CephFS will be offline and show up as "damaged" in ceph -s
>> The fix is to downgrade to 13.2.1 and issue a "ceph fs repaired " 
>> command.
>> 
>> 
>> Paul
>> 
>> Am Mi., 17. Okt. 2018 um 21:53 Uhr schrieb Michael Sudnick
>> :
>>> What exactly are the symptoms of the problem? I use cephfs with 13.2.2 with 
>>> two active MDS daemons and at least on the surface everything looks fine. 
>>> Is there anything I should avoid doing until 13.2.3?
>>> 
>>> On Wed, Oct 17, 2018, 14:10 Patrick Donnelly  wrote:
 On Wed, Oct 17, 2018 at 11:05 AM Alexandre DERUMIER  
 wrote:
> Hi,
> 
> Is it possible to have more infos or announce about this problem ?
> 
> I'm currently waiting to migrate from luminious to mimic, (I need new 
> quota feature for cephfs)
> 
> is it safe to upgrade to 13.2.2 ?
> 
> or better to wait to 13.2.3 ? or install 13.2.1 for now ?
 Upgrading to 13.2.1 would be safe.
 
 --
 Patrick Donnelly
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
> -- 
> Alfredo Daniel Rezinovsky
> Director de Tecnologías de Información y Comunicaciones
> Facultad de Ingeniería - Universidad Nacional de Cuyo
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-17 Thread Sergey Malinin
The problem is caused by unintentional change of on-disk format of MDS purge 
queue. If you have upgraded and didn't hit the bug -- that probably means your 
MDS daemon was deployed after the upgrade, otherwise it wouldn't start.


> On 18.10.2018, at 02:08, Виталий Филиппов  wrote:
> 
> I mean, does every upgraded installation hit this bug, or do some upgrade 
> without any problem?
> 
>> The problem occurs after upgrade, fresh 13.2.2 installs are not affected.
> 
> -- 
> With best regards,
>  Vitaliy Filippov

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-17 Thread Sergey Malinin
The problem occurs after upgrade, fresh 13.2.2 installs are not affected.


> On 17.10.2018, at 23:42, Виталий Филиппов  wrote:
> 
> By the way, does it happen with all installations or only under some 
> conditions?
> 
>> CephFS will be offline and show up as "damaged" in ceph -s
>> The fix is to downgrade to 13.2.1 and issue a "ceph fs repaired " 
>> command.
>> 
>> 
>> Paul
> 
> -- 
> With best regards,
>  Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to debug problem in MDS ?

2018-10-16 Thread Sergey Malinin
Are you running multiple active MDS daemons?
On MDS host issue "ceph-daemon mds.X config set debug_mds 20" for maximum 
logging verbosity.

> On 16.10.2018, at 19:23, Florent B  wrote:
> 
> Hi,
> 
> A few months ago I sent a message to that list about a problem with a
> Ceph + Dovecot setup.
> 
> Bug disappeared and I didn't answer to the thread.
> 
> Now the bug has come again (Luminous up-to-date cluster + Dovecot
> up-to-date + Debian Stretch up-to-date).
> 
> I know how to reproduce it, but it seems very related to my user's
> Dovecot data (few GB) and is related to file locking system (bug occurs
> when I set locking method to "fcntl" or "flock" in Dovecot, but not with
> "dotlock".
> 
> It ends to a unresponsive MDS (100% CPU hang, switching to another MDS
> but always staying at 100% CPU usage). I can't even use the admin socket
> when MDS is hanged.
> 
> I would like to know *exactly* which information do you need to
> investigate that bug ? (which commands, when, how to report large log
> files...)
> 
> Thank you.
> 
> Florent
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-14 Thread Sergey Malinin
Actual amount of memory used by VFS cache is available through 'grep Cached 
/proc/meminfo'. slabtop provides information about cache of inodes, dentries, 
and IO memory buffers (buffer_head).


> On 14.10.2018, at 17:28, jes...@krogh.cc wrote:
> 
>> Try looking in /proc/slabinfo / slabtop during your tests.
> 
> I need a bit of guidance here..  Does the slabinfo cover the VFS page
> cache ? .. I cannot seem to find any traces (sorting by size on
> machines with a huge cache does not really give anything). Perhaps
> I'm holding the screwdriver wrong?
> 
> -- 
> Jesper
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client - page cache being invaildated.

2018-10-14 Thread Sergey Malinin
Try looking in /proc/slabinfo / slabtop during your tests.


> On 14.10.2018, at 15:21, jes...@krogh.cc wrote:
> 
> Hi
> 
> We have a dataset of ~300 GB on CephFS which as being used for computations
> over and over agian .. being refreshed daily or similar.
> 
> When hosting it on NFS after refresh, they are transferred, but from
> there - they would be sitting in the kernel page cache of the client
> until they are refreshed serverside.
> 
> On CephFS it look "similar" but "different". Where the "steady state"
> operation over NFS would give a client/server traffic of < 1MB/s ..
> CephFS contantly pulls 50-100MB/s over the network.  This has
> implications for the clients that end up spending unnessary time waiting
> for IO in the execution.
> 
> This is in a setting where the CephFS client mem look like this:
> 
> $ free -h
>  totalusedfree  shared  buff/cache  
> available
> Mem:   377G 17G340G1.2G 19G   
> 354G
> Swap:  8.8G430M8.4G
> 
> 
> If I just repeatedly run (within a few minute) something that is using the
> files, then
> it is fully served out of client page cache (2GB'ish / s) ..  but it looks
> like
> it is being evicted way faster than in the NFS setting?
> 
> This is not scientific .. but the CMD is a cat /file/on/ceph > /dev/null -
> type on a total of 24GB data in 300'ish files.
> 
> $ free -h; time CMD ; sleep 1800; free -h; time CMD ; free -h; sleep 3600;
> time CMD ;
> 
>  totalusedfree  shared  buff/cache  
> available
> Mem:   377G 16G312G1.2G 48G   
> 355G
> Swap:  8.8G430M8.4G
> 
> real0m8.997s
> user0m2.036s
> sys 0m6.915s
>  totalusedfree  shared  buff/cache  
> available
> Mem:   377G 17G277G1.2G 82G   
> 354G
> Swap:  8.8G430M8.4G
> 
> real3m25.904s
> user0m2.794s
> sys 0m9.028s
>  totalusedfree  shared  buff/cache  
> available
> Mem:   377G 17G283G1.2G 76G   
> 353G
> Swap:  8.8G430M8.4G
> 
> real6m18.358s
> user0m2.847s
> sys 0m10.651s
> 
> 
> Munin graphs of the system confirms that there has been zero memory
> pressure over the period.
> 
> Is there things in the CephFS case that can cause the page-cache to be
> invailated?
> Could less agressive "read-ahead" play a role?
> 
> Other thoughts on what root cause on the different behaviour could be?
> 
> Clients are using 4.15 kernel.. Anyone aware of newer patches in this area
> that could impact ?
> 
> Jesper
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does anyone use interactive CLI mode?

2018-10-10 Thread Sergey Malinin
All uncommon tasks can easily be done using basic shell scripting so I don't 
see any practical use for such interface.


> On 10.10.2018, at 17:19, John Spray  wrote:
> 
> Hi all,
> 
> Since time immemorial, the Ceph CLI has had a mode where when run with
> no arguments, you just get an interactive prompt that lets you run
> commands without "ceph" at the start.
> 
> I recently discovered that we actually broke this in Mimic[1], and it
> seems that nobody noticed!
> 
> So the question is: does anyone actually use this feature?  It's not
> particularly expensive to maintain, but it might be nice to have one
> less path through the code if this is entirely unused.
> 
> Cheers,
> John
> 
> 1. https://github.com/ceph/ceph/pull/24521
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best version and SO for CefhFS

2018-10-10 Thread Sergey Malinin
Standby MDS is required for HA. It can be configured in standby-replay mode for 
faster failover. Otherwise, replaying the journal is incurred which can take 
somewhat longer.


> On 10.10.2018, at 13:57, Daniel Carrasco  wrote:
> 
> Thanks for your response.
> 
> I'll point in that direction.
> I also need a fast recovery in case that MDS die so, Standby MDS are 
> recomended or recovery is fast enought to be useful?
> 
> Greetings!
> 
> El mié., 10 oct. 2018 a las 12:26, Sergey Malinin ( <mailto:h...@newmail.com>>) escribió:
> 
> 
>> On 10.10.2018, at 10:49, Daniel Carrasco > <mailto:d.carra...@i2tic.com>> wrote:
>> 
>> Wich is the best configuration to avoid that MDS problems.
> Single active MDS with lots of RAM.
> 
> 
> 
> -- 
> _
> 
>   Daniel Carrasco Marín
>   Ingeniería para la Innovación i2TIC, S.L.
>   Tlf:  +34 911 12 32 84 Ext: 223
>   www.i2tic.com <http://www.i2tic.com/>
> _

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best version and SO for CefhFS

2018-10-10 Thread Sergey Malinin


> On 10.10.2018, at 10:49, Daniel Carrasco  wrote:
> 
> Wich is the best configuration to avoid that MDS problems.
Single active MDS with lots of RAM.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs still core dumping

2018-10-09 Thread Sergey Malinin
scan_links has finished and now I'm able to start MDS with a bunch of 'failed 
to open ino' and 'bad backtrace' log entries, but at least MDS no longer 
segfaults and I can mount the fs.


> On 9.10.2018, at 02:22, Sergey Malinin  wrote:
> 
> I was able to start MDS 13.2.1 when I had imported journal, ran 
> recover_dentries, reset journal, reset session table, and did ceph fs reset.
> However, I got about 1000 errors in log like bad backtrace, loaded dup inode, 
> etc. and it eventually failed on assert(stray_in->inode.nlink >= 1) right 
> after becoming 'active'.
> I'm doing scan_links to give it another try.
> 
> 
>> On 8.10.2018, at 23:43, Alfredo Daniel Rezinovsky > <mailto:alfrenov...@gmail.com>> wrote:
>> 
>> 
>> 
>> On 08/10/18 17:41, Sergey Malinin wrote:
>>> 
>>>> On 8.10.2018, at 23:23, Alfredo Daniel Rezinovsky >>> <mailto:alfrenov...@gmail.com>> wrote:
>>>> 
>>>> I need the data, even if it's read only.
>>> 
>>> After full data scan you should have been able to boot mds 13.2.2 and mount 
>>> the fs.
>> The problem started with the upgrade to 13.2.2. I downgraded to 13.2.1 and 
>> Yan Zhen told.
>> 
>> mds reports problems with the journals, and even reseting the journals MDS 
>> wont start.
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] vfs_ceph ignoring quotas

2018-10-09 Thread Sergey Malinin
cat /path/to/dir | grep rbytes | awk {'print $2'}
assuming cephfs mount has dirstat option.

Here is the script I used in my environment:

#!/bin/sh

if [ -f "$1/.quota" ]; then
TOTAL=`cat $1/.quota`
USED=`cat $1 | grep rbytes | awk {'print $2'}`
USED=$((USED/1000))
echo $TOTAL $((TOTAL-USED))
else
df -k $1 | tail -1 | awk {'print $2,$4'}
fi

> On 9.10.2018, at 13:50, Felix Stolte  wrote:
> 
> That's bad news, but maybe there is a workarround. Samba offers the 
> opportunity to define a custom df command. If I could extract the current 
> utilization or size of a directory with a quota, i think i should be able to 
> write a little df command. The quota is stored as an extended attribute, 
> where can i get its utilization?
> 
> Regards Felix
> 
> 
> On 10/09/2018 10:59 AM, John Spray wrote:
>> On Tue, Oct 9, 2018 at 9:14 AM Felix Stolte  wrote:
>>> Hi folks,
>>> 
>>> i'm running a luminous cluster on Ubuntu 18.04 an want to share folders
>>> on cephfs with samba using the vfs_ceph. Sharing works fine, but the
>>> quotas I set on the directories is ignored and every share reports its
>>> size as the total size of the cephfs. Anyone got this working? Or is the
>>> vfs lacking quota support?
>> Looks like the Samba VFS is always mounting CephFS at the root[1],
>> including if you're exporting a subdir (nfs ganesha used to be this
>> way too) -- CephFS can only do the "magic" quota-based statfs if it
>> knows it's mounting a particular subdir.
>> 
>> John
>> 
>> 1. 
>> https://github.com/samba-team/samba/blob/master/source3/modules/vfs_ceph.c#L123
>> 
>>> Regards Felix
>>> 
>>> --
>>> Forschungszentrum Jülich GmbH
>>> 52425 Jülich
>>> Sitz der Gesellschaft: Jülich
>>> Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498
>>> Vorsitzender des Aufsichtsrats: MinDir. Dr. Karl Eugen Huthmacher
>>> Geschäftsführung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
>>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>>> Prof. Dr. Sebastian M. Schmidt
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> -- 
> Felix Stolte
> IT-Services
> Tel.: +49 2461 61-9243
> Email: f.sto...@fz-juelich.de 
> 
> Forschungszentrum Jülich GmbH
> 52425 Jülich
> Sitz der Gesellschaft: Jülich
> Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir. Dr. Karl Eugen Huthmacher
> Geschäftsführung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs still core dumping

2018-10-08 Thread Sergey Malinin
I was able to start MDS 13.2.1 when I had imported journal, ran 
recover_dentries, reset journal, reset session table, and did ceph fs reset.
However, I got about 1000 errors in log like bad backtrace, loaded dup inode, 
etc. and it eventually failed on assert(stray_in->inode.nlink >= 1) right after 
becoming 'active'.
I'm doing scan_links to give it another try.


> On 8.10.2018, at 23:43, Alfredo Daniel Rezinovsky  
> wrote:
> 
> 
> 
> On 08/10/18 17:41, Sergey Malinin wrote:
>> 
>>> On 8.10.2018, at 23:23, Alfredo Daniel Rezinovsky >> <mailto:alfrenov...@gmail.com>> wrote:
>>> 
>>> I need the data, even if it's read only.
>> 
>> After full data scan you should have been able to boot mds 13.2.2 and mount 
>> the fs.
> The problem started with the upgrade to 13.2.2. I downgraded to 13.2.1 and 
> Yan Zhen told.
> 
> mds reports problems with the journals, and even reseting the journals MDS 
> wont start.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs still core dumping

2018-10-08 Thread Sergey Malinin
... and cephfs-table-tool  reset session ?

> On 9.10.2018, at 01:32, Sergey Malinin  wrote:
> 
> Have you tried to recover dentries and then reset the journal?
> 
> 
>> On 8.10.2018, at 23:43, Alfredo Daniel Rezinovsky > <mailto:alfrenov...@gmail.com>> wrote:
>> 
>> 
>> 
>> On 08/10/18 17:41, Sergey Malinin wrote:
>>> 
>>>> On 8.10.2018, at 23:23, Alfredo Daniel Rezinovsky >>> <mailto:alfrenov...@gmail.com>> wrote:
>>>> 
>>>> I need the data, even if it's read only.
>>> 
>>> After full data scan you should have been able to boot mds 13.2.2 and mount 
>>> the fs.
>> The problem started with the upgrade to 13.2.2. I downgraded to 13.2.1 and 
>> Yan Zhen told.
>> 
>> mds reports problems with the journals, and even reseting the journals MDS 
>> wont start.
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs still core dumping

2018-10-08 Thread Sergey Malinin
Have you tried to recover dentries and then reset the journal?


> On 8.10.2018, at 23:43, Alfredo Daniel Rezinovsky  
> wrote:
> 
> 
> 
> On 08/10/18 17:41, Sergey Malinin wrote:
>> 
>>> On 8.10.2018, at 23:23, Alfredo Daniel Rezinovsky >> <mailto:alfrenov...@gmail.com>> wrote:
>>> 
>>> I need the data, even if it's read only.
>> 
>> After full data scan you should have been able to boot mds 13.2.2 and mount 
>> the fs.
> The problem started with the upgrade to 13.2.2. I downgraded to 13.2.1 and 
> Yan Zhen told.
> 
> mds reports problems with the journals, and even reseting the journals MDS 
> wont start.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs still core dumping

2018-10-08 Thread Sergey Malinin

> On 8.10.2018, at 23:23, Alfredo Daniel Rezinovsky  
> wrote:
> 
> I need the data, even if it's read only.

After full data scan you should have been able to boot mds 13.2.2 and mount the 
fs.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Sergey Malinin

> On 8.10.2018, at 16:07, Alfredo Daniel Rezinovsky  
> wrote:
> 
> So I can stopt  cephfs-data-scan, run the import, downgrade, and then reset 
> the purge queue?

I suggest that you backup metadata pool so that in case of failure you can 
continue with data scan from where you stopped.
I've read somewhere that backup must be done using rados export rather that 
cppool in order to keep omapkeys.

> 
> Please remember me the commands:
> I've been 3 days without sleep, and I don't wanna to broke it more.

Lucky man, I've been struggling with it for almost 2 weeks.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Sergey Malinin



> On 8.10.2018, at 12:37, Yan, Zheng  wrote:
> 
> On Mon, Oct 8, 2018 at 4:37 PM Sergey Malinin  wrote:
>> 
>> What additional steps need to be taken in order to (try to) regain access to 
>> the fs providing that I backed up metadata pool, created alternate metadata 
>> pool and ran scan_extents, scan_links, scan_inodes, and somewhat recursive 
>> scrub.
>> After that I only mounted the fs read-only to backup the data.
>> Would anything even work if I had mds journal and purge queue truncated?
>> 
> 
> did you backed up whole metadata pool?  did you make any modification
> to the original metadata pool? If you did, what modifications?

I backed up both journal and purge queue and used cephfs-journal-tool to 
recover dentries, then reset journal and purge queue on original metadata pool.
Before proceeding to alternate metadata pool recovery I was able to start MDS 
but it soon failed throwing lots of 'loaded dup inode' errors, not sure if that 
involved changing anything in the pool.
I have left the original metadata pool untouched sine then.


> 
> Yan, Zheng
> 
>> 
>>> On 8.10.2018, at 05:15, Yan, Zheng  wrote:
>>> 
>>> Sorry. this is caused wrong backport. downgrading mds to 13.2.1 and
>>> marking mds repaird can resolve this.
>>> 
>>> Yan, Zheng
>>> On Sat, Oct 6, 2018 at 8:26 AM Sergey Malinin  wrote:
>>>> 
>>>> Update:
>>>> I discovered http://tracker.ceph.com/issues/24236 and 
>>>> https://github.com/ceph/ceph/pull/22146
>>>> Make sure that it is not relevant in your case before proceeding to 
>>>> operations that modify on-disk data.
>>>> 
>>>> 
>>>> On 6.10.2018, at 03:17, Sergey Malinin  wrote:
>>>> 
>>>> I ended up rescanning the entire fs using alternate metadata pool approach 
>>>> as in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
>>>> The process has not competed yet because during the recovery our cluster 
>>>> encountered another problem with OSDs that I got fixed yesterday (thanks 
>>>> to Igor Fedotov @ SUSE).
>>>> The first stage (scan_extents) completed in 84 hours (120M objects in data 
>>>> pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted 
>>>> by OSDs failure so I have no timing stats but it seems to be runing 2-3 
>>>> times faster than extents scan.
>>>> As to root cause -- in my case I recall that during upgrade I had 
>>>> forgotten to restart 3 OSDs, one of which was holding metadata pool 
>>>> contents, before restarting MDS daemons and that seemed to had an impact 
>>>> on MDS journal corruption, because when I restarted those OSDs, MDS was 
>>>> able to start up but soon failed throwing lots of 'loaded dup inode' 
>>>> errors.
>>>> 
>>>> 
>>>> On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky  
>>>> wrote:
>>>> 
>>>> Same problem...
>>>> 
>>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>>> 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.016c
>>>> Overall journal integrity: DAMAGED
>>>> Objects missing:
>>>> 0x16c
>>>> Corrupt regions:
>>>> 0x5b00-
>>>> 
>>>> Just after upgrade to 13.2.2
>>>> 
>>>> Did you fixed it?
>>>> 
>>>> 
>>>> On 26/09/18 13:05, Sergey Malinin wrote:
>>>> 
>>>> Hello,
>>>> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
>>>> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
>>>> damaged. Resetting purge_queue does not seem to work well as journal still 
>>>> appears to be damaged.
>>>> Can anybody help?
>>>> 
>>>> mds log:
>>>> 
>>>> -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map to 
>>>> version 586 from mon.2
>>>> -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i 
>>>> am now mds.0.583
>>>> -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map 
>>>> state change up:rejoin --> up:active
>>>> -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- 
>>>> successful recovery!
>>>> 
>>>>  -38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue _consume: 
>>>> Decode error at re

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Sergey Malinin
What additional steps need to be taken in order to (try to) regain access to 
the fs providing that I backed up metadata pool, created alternate metadata 
pool and ran scan_extents, scan_links, scan_inodes, and somewhat recursive 
scrub.
After that I only mounted the fs read-only to backup the data.
Would anything even work if I had mds journal and purge queue truncated?


> On 8.10.2018, at 05:15, Yan, Zheng  wrote:
> 
> Sorry. this is caused wrong backport. downgrading mds to 13.2.1 and
> marking mds repaird can resolve this.
> 
> Yan, Zheng
> On Sat, Oct 6, 2018 at 8:26 AM Sergey Malinin  wrote:
>> 
>> Update:
>> I discovered http://tracker.ceph.com/issues/24236 and 
>> https://github.com/ceph/ceph/pull/22146
>> Make sure that it is not relevant in your case before proceeding to 
>> operations that modify on-disk data.
>> 
>> 
>> On 6.10.2018, at 03:17, Sergey Malinin  wrote:
>> 
>> I ended up rescanning the entire fs using alternate metadata pool approach 
>> as in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
>> The process has not competed yet because during the recovery our cluster 
>> encountered another problem with OSDs that I got fixed yesterday (thanks to 
>> Igor Fedotov @ SUSE).
>> The first stage (scan_extents) completed in 84 hours (120M objects in data 
>> pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by 
>> OSDs failure so I have no timing stats but it seems to be runing 2-3 times 
>> faster than extents scan.
>> As to root cause -- in my case I recall that during upgrade I had forgotten 
>> to restart 3 OSDs, one of which was holding metadata pool contents, before 
>> restarting MDS daemons and that seemed to had an impact on MDS journal 
>> corruption, because when I restarted those OSDs, MDS was able to start up 
>> but soon failed throwing lots of 'loaded dup inode' errors.
>> 
>> 
>> On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky  
>> wrote:
>> 
>> Same problem...
>> 
>> # cephfs-journal-tool --journal=purge_queue journal inspect
>> 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.016c
>> Overall journal integrity: DAMAGED
>> Objects missing:
>>  0x16c
>> Corrupt regions:
>>  0x5b00-
>> 
>> Just after upgrade to 13.2.2
>> 
>> Did you fixed it?
>> 
>> 
>> On 26/09/18 13:05, Sergey Malinin wrote:
>> 
>> Hello,
>> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
>> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
>> damaged. Resetting purge_queue does not seem to work well as journal still 
>> appears to be damaged.
>> Can anybody help?
>> 
>> mds log:
>> 
>>  -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map to 
>> version 586 from mon.2
>>  -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i am 
>> now mds.0.583
>>  -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map 
>> state change up:rejoin --> up:active
>>  -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- 
>> successful recovery!
>> 
>>   -38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue _consume: 
>> Decode error at read_pos=0x322ec6636
>>   -37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 
>> set_want_state: up:active -> down:damaged
>>   -36> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 _send 
>> down:damaged seq 137
>>   -35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: _send_mon_message 
>> to mon.ceph3 at mon:6789/0
>>   -34> 2018-09-26 18:42:32.707 7f70f28a7700  1 -- mds:6800/e4cc09cf --> 
>> mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- 
>> 0x563b321ad480 con 0
>> 
>>-3> 2018-09-26 18:42:32.743 7f70f98b5700  5 -- mds:6800/3838577103 >> 
>> mon:6789/0 conn(0x563b3213e000 :-1 
>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx mon.2 seq 
>> 29 0x563b321ab880 mdsbeaco
>> n(85106/mds2 down:damaged seq 311 v587) v7
>>-2> 2018-09-26 18:42:32.743 7f70f98b5700  1 -- mds:6800/3838577103 <== 
>> mon.2 mon:6789/0 29  mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 
>>  129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e
>> 000
>>-1> 2018-09-26 18:42:32.743 7f70f98b5700  5 mds.beacon.mds2 
>> handle_mds_beacon down:damaged seq 311 rtt 0.038261
>> 0> 2018-09-26 18:42:32.743 7f70f28a7700  1 mds.mds2 respawn!
>> 
>

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-07 Thread Sergey Malinin
I was able to start MDS and mount the fs with broken ownership/permissions and 
8k out of millions files in lost+found.


> On 7.10.2018, at 02:04, Sergey Malinin  wrote:
> 
> I'm at scan_links now, will post an update once it has finished.
> Have you reset the journal after fs recovery as suggested in the doc?
> 
> quote:
> 
> If the damaged filesystem contains dirty journal data, it may be recovered 
> next with:
> 
> cephfs-journal-tool --rank=:0 event 
> recover_dentries list --alternate-pool recovery
> cephfs-journal-tool --rank recovery-fs:0 journal reset --force
> 
> 
>> On 7.10.2018, at 00:36, Alfredo Daniel Rezinovsky > <mailto:alfrenov...@gmail.com>> wrote:
>> 
>> I did something wrong in the upgrade restart also...
>> 
>> after rescaning with:
>> 
>> cephfs-data-scan scan_extents cephfs_data (with threads)
>> 
>> cephfs-data-scan scan_inodes cephfs_data (with threads)
>> 
>> cephfs-data-scan scan_links
>> 
>> My MDS still crashes and wont replay.
>>  1: (()+0x3ec320) [0x55b0e2bd2320]
>>  2: (()+0x12890) [0x7fc3adce3890]
>>  3: (gsignal()+0xc7) [0x7fc3acddbe97]
>>  4: (abort()+0x141) [0x7fc3acddd801]
>>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x250) [0x7fc3ae3cc080]
>>  6: (()+0x26c0f7) [0x7fc3ae3cc0f7]
>>  7: (()+0x21eb27) [0x55b0e2a04b27]
>>  8: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, 
>> snapid_t)+0xc0) [0x55b0e2a04d40]
>>  9: (Locker::check_inode_max_size(CInode*, bool, unsigned long, unsigned 
>> long, utime_t)+0x91d) [0x55b0e2a6a0fd]
>>  10: (RecoveryQueue::_recovered(CInode*, int, unsigned long, utime_t)+0x39f) 
>> [0x55b0e2a3ca2f]
>>  11: (MDSIOContextBase::complete(int)+0x119) [0x55b0e2b54ab9]
>>  12: (Filer::C_Probe::finish(int)+0xe7) [0x55b0e2bd94e7]
>>  13: (Context::complete(int)+0x9) [0x55b0e28e9719]
>>  14: (Finisher::finisher_thread_entry()+0x12e) [0x7fc3ae3ca4ce]
>>  15: (()+0x76db) [0x7fc3adcd86db]
>>  16: (clone()+0x3f) [0x7fc3acebe88f]
>> 
>> Did you do somenthing else before starting the MDSs again?
>> 
>> On 05/10/18 21:17, Sergey Malinin wrote:
>>> I ended up rescanning the entire fs using alternate metadata pool approach 
>>> as in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ 
>>> <http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/>
>>> The process has not competed yet because during the recovery our cluster 
>>> encountered another problem with OSDs that I got fixed yesterday (thanks to 
>>> Igor Fedotov @ SUSE).
>>> The first stage (scan_extents) completed in 84 hours (120M objects in data 
>>> pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by 
>>> OSDs failure so I have no timing stats but it seems to be runing 2-3 times 
>>> faster than extents scan.
>>> As to root cause -- in my case I recall that during upgrade I had forgotten 
>>> to restart 3 OSDs, one of which was holding metadata pool contents, before 
>>> restarting MDS daemons and that seemed to had an impact on MDS journal 
>>> corruption, because when I restarted those OSDs, MDS was able to start up 
>>> but soon failed throwing lots of 'loaded dup inode' errors.
>>> 
>>> 
>>>> On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky >>> <mailto:alfrenov...@gmail.com>> wrote:
>>>> 
>>>> Same problem...
>>>> 
>>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>>> 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.016c
>>>> Overall journal integrity: DAMAGED
>>>> Objects missing:
>>>>   0x16c
>>>> Corrupt regions:
>>>>   0x5b00-
>>>> 
>>>> Just after upgrade to 13.2.2
>>>> 
>>>> Did you fixed it?
>>>> 
>>>> 
>>>> On 26/09/18 13:05, Sergey Malinin wrote:
>>>>> Hello,
>>>>> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
>>>>> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
>>>>> damaged. Resetting purge_queue does not seem to work well as journal 
>>>>> still appears to be damaged.
>>>>> Can anybody help?
>>>>> 
>>>>> mds log:
>>>>> 
>>>>>   -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map 
>>>>> to version 586 from mon.2
>>>>>   -788> 2018-09-26 1

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-06 Thread Sergey Malinin
I'm at scan_links now, will post an update once it has finished.
Have you reset the journal after fs recovery as suggested in the doc?

quote:

If the damaged filesystem contains dirty journal data, it may be recovered next 
with:

cephfs-journal-tool --rank=:0 event recover_dentries 
list --alternate-pool recovery
cephfs-journal-tool --rank recovery-fs:0 journal reset --force


> On 7.10.2018, at 00:36, Alfredo Daniel Rezinovsky  
> wrote:
> 
> I did something wrong in the upgrade restart also...
> 
> after rescaning with:
> 
> cephfs-data-scan scan_extents cephfs_data (with threads)
> 
> cephfs-data-scan scan_inodes cephfs_data (with threads)
> 
> cephfs-data-scan scan_links
> 
> My MDS still crashes and wont replay.
>  1: (()+0x3ec320) [0x55b0e2bd2320]
>  2: (()+0x12890) [0x7fc3adce3890]
>  3: (gsignal()+0xc7) [0x7fc3acddbe97]
>  4: (abort()+0x141) [0x7fc3acddd801]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x250) [0x7fc3ae3cc080]
>  6: (()+0x26c0f7) [0x7fc3ae3cc0f7]
>  7: (()+0x21eb27) [0x55b0e2a04b27]
>  8: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, 
> snapid_t)+0xc0) [0x55b0e2a04d40]
>  9: (Locker::check_inode_max_size(CInode*, bool, unsigned long, unsigned 
> long, utime_t)+0x91d) [0x55b0e2a6a0fd]
>  10: (RecoveryQueue::_recovered(CInode*, int, unsigned long, utime_t)+0x39f) 
> [0x55b0e2a3ca2f]
>  11: (MDSIOContextBase::complete(int)+0x119) [0x55b0e2b54ab9]
>  12: (Filer::C_Probe::finish(int)+0xe7) [0x55b0e2bd94e7]
>  13: (Context::complete(int)+0x9) [0x55b0e28e9719]
>  14: (Finisher::finisher_thread_entry()+0x12e) [0x7fc3ae3ca4ce]
>  15: (()+0x76db) [0x7fc3adcd86db]
>  16: (clone()+0x3f) [0x7fc3acebe88f]
> 
> Did you do somenthing else before starting the MDSs again?
> 
> On 05/10/18 21:17, Sergey Malinin wrote:
>> I ended up rescanning the entire fs using alternate metadata pool approach 
>> as in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ 
>> <http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/>
>> The process has not competed yet because during the recovery our cluster 
>> encountered another problem with OSDs that I got fixed yesterday (thanks to 
>> Igor Fedotov @ SUSE).
>> The first stage (scan_extents) completed in 84 hours (120M objects in data 
>> pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by 
>> OSDs failure so I have no timing stats but it seems to be runing 2-3 times 
>> faster than extents scan.
>> As to root cause -- in my case I recall that during upgrade I had forgotten 
>> to restart 3 OSDs, one of which was holding metadata pool contents, before 
>> restarting MDS daemons and that seemed to had an impact on MDS journal 
>> corruption, because when I restarted those OSDs, MDS was able to start up 
>> but soon failed throwing lots of 'loaded dup inode' errors.
>> 
>> 
>>> On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky >> <mailto:alfrenov...@gmail.com>> wrote:
>>> 
>>> Same problem...
>>> 
>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>> 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.016c
>>> Overall journal integrity: DAMAGED
>>> Objects missing:
>>>   0x16c
>>> Corrupt regions:
>>>   0x5b00-
>>> 
>>> Just after upgrade to 13.2.2
>>> 
>>> Did you fixed it?
>>> 
>>> 
>>> On 26/09/18 13:05, Sergey Malinin wrote:
>>>> Hello,
>>>> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
>>>> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
>>>> damaged. Resetting purge_queue does not seem to work well as journal still 
>>>> appears to be damaged.
>>>> Can anybody help?
>>>> 
>>>> mds log:
>>>> 
>>>>   -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map 
>>>> to version 586 from mon.2
>>>>   -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i 
>>>> am now mds.0.583
>>>>   -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map 
>>>> state change up:rejoin --> up:active
>>>>   -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- 
>>>> successful recovery!
>>>> 
>>>>-38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue 
>>>> _consume: Decode error at read_pos=0x322ec6636
>>>>-37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.

Re: [ceph-users] list admin issues

2018-10-06 Thread Sergey Malinin
Same here. Gmail + own domain.
On Oct 6, 2018, 20:52 +0300, Tren Blackburn , wrote:
> And same here. I'm glad I'm not the only one this is happening to.
>
> t.
>
> On October 6, 2018 at 10:50:10 AM, David C (dcsysengin...@gmail.com) wrote:
> > Same issue here, Gmail user, member of different lists but only get 
> > disabled on ceph-users. Happens about once a month but had three in Sept.
> >
> > > On Sat, 6 Oct 2018, 18:28 Janne Johansson,  wrote:
> > > > Den lör 6 okt. 2018 kl 15:06 skrev Elias Abacioglu
> > > > :
> > > > >
> > > > > Hi,
> > > > >
> > > > > I'm bumping this old thread cause it's getting annoying. My 
> > > > > membership get disabled twice a month.
> > > > > Between my two Gmail accounts I'm in more than 25 mailing lists and I 
> > > > > see this behavior only here. Why is only ceph-users only affected? 
> > > > > Maybe Christian was on to something, is this intentional?
> > > > > Reality is that there is a lot of ceph-users with Gmail accounts, 
> > > > > perhaps it wouldn't be so bad to actually trying to figure this one 
> > > > > out?
> > > > >
> > > > > So can the maintainers of this list please investigate what actually 
> > > > > gets bounced? Look at my address if you want.
> > > > > I got disabled 20181006, 20180927, 20180916, 20180725, 20180718 most 
> > > > > recently.
> > > > > Please help!
> > > >
> > > > Same here.
> > > >
> > > >
> > > > --
> > > > May the most significant bit of your life be positive.
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-05 Thread Sergey Malinin
Update:
I discovered http://tracker.ceph.com/issues/24236 
<http://tracker.ceph.com/issues/24236> and 
https://github.com/ceph/ceph/pull/22146 
<https://github.com/ceph/ceph/pull/22146>
Make sure that it is not relevant in your case before proceeding to operations 
that modify on-disk data.


> On 6.10.2018, at 03:17, Sergey Malinin  wrote:
> 
> I ended up rescanning the entire fs using alternate metadata pool approach as 
> in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ 
> <http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/>
> The process has not competed yet because during the recovery our cluster 
> encountered another problem with OSDs that I got fixed yesterday (thanks to 
> Igor Fedotov @ SUSE).
> The first stage (scan_extents) completed in 84 hours (120M objects in data 
> pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by 
> OSDs failure so I have no timing stats but it seems to be runing 2-3 times 
> faster than extents scan.
> As to root cause -- in my case I recall that during upgrade I had forgotten 
> to restart 3 OSDs, one of which was holding metadata pool contents, before 
> restarting MDS daemons and that seemed to had an impact on MDS journal 
> corruption, because when I restarted those OSDs, MDS was able to start up but 
> soon failed throwing lots of 'loaded dup inode' errors.
> 
> 
>> On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky > <mailto:alfrenov...@gmail.com>> wrote:
>> 
>> Same problem...
>> 
>> # cephfs-journal-tool --journal=purge_queue journal inspect
>> 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.016c
>> Overall journal integrity: DAMAGED
>> Objects missing:
>>   0x16c
>> Corrupt regions:
>>   0x5b00-
>> 
>> Just after upgrade to 13.2.2
>> 
>> Did you fixed it?
>> 
>> 
>> On 26/09/18 13:05, Sergey Malinin wrote:
>>> Hello,
>>> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
>>> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
>>> damaged. Resetting purge_queue does not seem to work well as journal still 
>>> appears to be damaged.
>>> Can anybody help?
>>> 
>>> mds log:
>>> 
>>>   -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map 
>>> to version 586 from mon.2
>>>   -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i 
>>> am now mds.0.583
>>>   -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map 
>>> state change up:rejoin --> up:active
>>>   -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- 
>>> successful recovery!
>>> 
>>>-38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue _consume: 
>>> Decode error at read_pos=0x322ec6636
>>>-37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 
>>> set_want_state: up:active -> down:damaged
>>>-36> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 _send 
>>> down:damaged seq 137
>>>-35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: 
>>> _send_mon_message to mon.ceph3 at mon:6789/0
>>>-34> 2018-09-26 18:42:32.707 7f70f28a7700  1 -- mds:6800/e4cc09cf --> 
>>> mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- 
>>> 0x563b321ad480 con 0
>>> 
>>> -3> 2018-09-26 18:42:32.743 7f70f98b5700  5 -- mds:6800/3838577103 >> 
>>> mon:6789/0 conn(0x563b3213e000 :-1 
>>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx mon.2 seq 
>>> 29 0x563b321ab880 mdsbeaco
>>> n(85106/mds2 down:damaged seq 311 v587) v7
>>> -2> 2018-09-26 18:42:32.743 7f70f98b5700  1 -- mds:6800/3838577103 <== 
>>> mon.2 mon:6789/0 29  mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 
>>>  129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e
>>> 000
>>> -1> 2018-09-26 18:42:32.743 7f70f98b5700  5 mds.beacon.mds2 
>>> handle_mds_beacon down:damaged seq 311 rtt 0.038261
>>>  0> 2018-09-26 18:42:32.743 7f70f28a7700  1 mds.mds2 respawn!
>>> 
>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>> Overall journal integrity: DAMAGED
>>> Corrupt regions:
>>>   0x322ec65d9-
>>> 
>>> # cephfs-journal-tool --journal=purge_queue journal reset
>>> old journal was 13470819801~8463
>>> new journal start will be 13472104448 (1276184 bytes past old end)
>>> writing journal head
>>> done
>>> 
>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>> 2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object 500.0c8c
>>> Overall journal integrity: DAMAGED
>>> Objects missing:
>>>   0xc8c
>>> Corrupt regions:
>>>   0x32300-
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent directory content in cephfs

2018-10-05 Thread Sergey Malinin
Are you sure these mounts (work/06 and work/6c) refer to the same directory?

> On 5.10.2018, at 13:57, Burkhard Linke 
>  wrote:
> 
> root@host2:~# ls /ceph/sge-tmp/db/work/06/ | wc -l
...
> root@host3:~# ls /ceph/sge-tmp/db/work/6c | wc -l

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-03 Thread Sergey Malinin
Finally goodness happened!
I applied PR and ran repair on OSD unmodified after initial failure. It went 
through without any errors and now I'm able to fuse mount the OSD and export 
PGs off it using ceph-objectstore-tool. Just in order to not mess it up I 
haven't started ceph-osd until I have PGs backed up.
Cheers Igor, you're the best!


> On 3.10.2018, at 14:39, Igor Fedotov  wrote:
> 
> To fix this specific issue please apply the following PR: 
> https://github.com/ceph/ceph/pull/24339
> 
> This wouldn't fix original issue but just in case please try to run repair 
> again. Will need log if an error is different from ENOSPC from your latest 
> email.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/3/2018 1:58 PM, Sergey Malinin wrote:
>> Repair has gone farther but failed on something different - this time it 
>> appears to be related to store inconsistency rather than lack of free space. 
>> Emailed log to you, beware: over 2GB uncompressed.
>> 
>> 
>>> On 3.10.2018, at 13:15, Igor Fedotov  wrote:
>>> 
>>> You may want to try new updates from the PR along with disabling flush on 
>>> recovery for rocksdb (avoid_flush_during_recovery parameter).
>>> 
>>> Full cmd line might looks like:
>>> 
>>> CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" 
>>> bin/ceph-bluestore-tool --path  repair
>>> 
>>> 
>>> To be applied for "non-expanded" OSDs where repair didn't pass.
>>> 
>>> Please collect a log during repair...
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> On 10/2/2018 4:32 PM, Sergey Malinin wrote:
>>>> Repair goes through only when LVM volume has been expanded, otherwise it 
>>>> fails with enospc as well as any other operation. However, expanding the 
>>>> volume immediately renders bluefs unmountable with IO error.
>>>> 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very 
>>>> end of bluefs-log-dump), I'm not sure whether corruption occurred before 
>>>> or after volume expansion.
>>>> 
>>>> 
>>>>> On 2.10.2018, at 16:07, Igor Fedotov  wrote:
>>>>> 
>>>>> You mentioned repair had worked before, is that correct? What's the 
>>>>> difference now except the applied patch? Different OSD? Anything else?
>>>>> 
>>>>> 
>>>>> On 10/2/2018 3:52 PM, Sergey Malinin wrote:
>>>>> 
>>>>>> It didn't work, emailed logs to you.
>>>>>> 
>>>>>> 
>>>>>>> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
>>>>>>> 
>>>>>>> The major change is in get_bluefs_rebalance_txn function, it lacked 
>>>>>>> bluefs_rebalance_txn assignment..
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
>>>>>>>> PR doesn't seem to have changed since yesterday. Am I missing 
>>>>>>>> something?
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
>>>>>>>>> 
>>>>>>>>> Please update the patch from the PR - it didn't update bluefs extents 
>>>>>>>>> list before.
>>>>>>>>> 
>>>>>>>>> Also please set debug bluestore 20 when re-running repair and collect 
>>>>>>>>> the log.
>>>>>>>>> 
>>>>>>>>> If repair doesn't help - would you send repair and startup logs 
>>>>>>>>> directly to me as I have some issues accessing ceph-post-file uploads.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Igor
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
>>>>>>>>>> Yes, I did repair all OSDs and it finished with 'repair success'. I 
>>>>>>>>>> backed up OSDs so now I have more room to play.
>>>>>>>>>> I posted log files using ceph-post-file with the following IDs:
>>>>>>>>>> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
>>>>>>>&g

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-03 Thread Sergey Malinin
Update:
I rebuilt ceph-osd with latest PR and it started, worked for a few minutes and 
eventually failed on enospc.
After that ceph-bluestore-tool repair started to fail on enospc again. I was 
unable to collect ceph-osd log, so emailed you the most recent repair log.



> On 3.10.2018, at 13:58, Sergey Malinin  wrote:
> 
> Repair has gone farther but failed on something different - this time it 
> appears to be related to store inconsistency rather than lack of free space. 
> Emailed log to you, beware: over 2GB uncompressed.
> 
> 
>> On 3.10.2018, at 13:15, Igor Fedotov  wrote:
>> 
>> You may want to try new updates from the PR along with disabling flush on 
>> recovery for rocksdb (avoid_flush_during_recovery parameter).
>> 
>> Full cmd line might looks like:
>> 
>> CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" 
>> bin/ceph-bluestore-tool --path  repair
>> 
>> 
>> To be applied for "non-expanded" OSDs where repair didn't pass.
>> 
>> Please collect a log during repair...
>> 
>> 
>> Thanks,
>> 
>> Igor
>> 
>> On 10/2/2018 4:32 PM, Sergey Malinin wrote:
>>> Repair goes through only when LVM volume has been expanded, otherwise it 
>>> fails with enospc as well as any other operation. However, expanding the 
>>> volume immediately renders bluefs unmountable with IO error.
>>> 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very 
>>> end of bluefs-log-dump), I'm not sure whether corruption occurred before or 
>>> after volume expansion.
>>> 
>>> 
>>>> On 2.10.2018, at 16:07, Igor Fedotov  wrote:
>>>> 
>>>> You mentioned repair had worked before, is that correct? What's the 
>>>> difference now except the applied patch? Different OSD? Anything else?
>>>> 
>>>> 
>>>> On 10/2/2018 3:52 PM, Sergey Malinin wrote:
>>>> 
>>>>> It didn't work, emailed logs to you.
>>>>> 
>>>>> 
>>>>>> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
>>>>>> 
>>>>>> The major change is in get_bluefs_rebalance_txn function, it lacked 
>>>>>> bluefs_rebalance_txn assignment..
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
>>>>>>> PR doesn't seem to have changed since yesterday. Am I missing something?
>>>>>>> 
>>>>>>> 
>>>>>>>> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
>>>>>>>> 
>>>>>>>> Please update the patch from the PR - it didn't update bluefs extents 
>>>>>>>> list before.
>>>>>>>> 
>>>>>>>> Also please set debug bluestore 20 when re-running repair and collect 
>>>>>>>> the log.
>>>>>>>> 
>>>>>>>> If repair doesn't help - would you send repair and startup logs 
>>>>>>>> directly to me as I have some issues accessing ceph-post-file uploads.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Igor
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
>>>>>>>>> Yes, I did repair all OSDs and it finished with 'repair success'. I 
>>>>>>>>> backed up OSDs so now I have more room to play.
>>>>>>>>> I posted log files using ceph-post-file with the following IDs:
>>>>>>>>> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
>>>>>>>>> 20df7df5-f0c9-4186-aa21-4e5c0172cd93
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
>>>>>>>>>> 
>>>>>>>>>> You did repair for any of this OSDs, didn't you? For all of them?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Would you please provide the log for both types (failed on mount and 
>>>>>>>>>> failed with enospc) of failing OSDs. Prior to collecting please 
>>>>>>>>>> remove existing ones prior and set debug bluestore to 20.
>>>>>>>>>> 
>>>>

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-03 Thread Sergey Malinin
Repair has gone farther but failed on something different - this time it 
appears to be related to store inconsistency rather than lack of free space. 
Emailed log to you, beware: over 2GB uncompressed.


> On 3.10.2018, at 13:15, Igor Fedotov  wrote:
> 
> You may want to try new updates from the PR along with disabling flush on 
> recovery for rocksdb (avoid_flush_during_recovery parameter).
> 
> Full cmd line might looks like:
> 
> CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" 
> bin/ceph-bluestore-tool --path  repair
> 
> 
> To be applied for "non-expanded" OSDs where repair didn't pass.
> 
> Please collect a log during repair...
> 
> 
> Thanks,
> 
> Igor
> 
> On 10/2/2018 4:32 PM, Sergey Malinin wrote:
>> Repair goes through only when LVM volume has been expanded, otherwise it 
>> fails with enospc as well as any other operation. However, expanding the 
>> volume immediately renders bluefs unmountable with IO error.
>> 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very 
>> end of bluefs-log-dump), I'm not sure whether corruption occurred before or 
>> after volume expansion.
>> 
>> 
>>> On 2.10.2018, at 16:07, Igor Fedotov  wrote:
>>> 
>>> You mentioned repair had worked before, is that correct? What's the 
>>> difference now except the applied patch? Different OSD? Anything else?
>>> 
>>> 
>>> On 10/2/2018 3:52 PM, Sergey Malinin wrote:
>>> 
>>>> It didn't work, emailed logs to you.
>>>> 
>>>> 
>>>>> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
>>>>> 
>>>>> The major change is in get_bluefs_rebalance_txn function, it lacked 
>>>>> bluefs_rebalance_txn assignment..
>>>>> 
>>>>> 
>>>>> 
>>>>> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
>>>>>> PR doesn't seem to have changed since yesterday. Am I missing something?
>>>>>> 
>>>>>> 
>>>>>>> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
>>>>>>> 
>>>>>>> Please update the patch from the PR - it didn't update bluefs extents 
>>>>>>> list before.
>>>>>>> 
>>>>>>> Also please set debug bluestore 20 when re-running repair and collect 
>>>>>>> the log.
>>>>>>> 
>>>>>>> If repair doesn't help - would you send repair and startup logs 
>>>>>>> directly to me as I have some issues accessing ceph-post-file uploads.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Igor
>>>>>>> 
>>>>>>> 
>>>>>>> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
>>>>>>>> Yes, I did repair all OSDs and it finished with 'repair success'. I 
>>>>>>>> backed up OSDs so now I have more room to play.
>>>>>>>> I posted log files using ceph-post-file with the following IDs:
>>>>>>>> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
>>>>>>>> 20df7df5-f0c9-4186-aa21-4e5c0172cd93
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
>>>>>>>>> 
>>>>>>>>> You did repair for any of this OSDs, didn't you? For all of them?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Would you please provide the log for both types (failed on mount and 
>>>>>>>>> failed with enospc) of failing OSDs. Prior to collecting please 
>>>>>>>>> remove existing ones prior and set debug bluestore to 20.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>>>>>>>>>> I was able to apply patches to mimic, but nothing changed. One osd 
>>>>>>>>>> that I had space expanded on fails with bluefs mount IO error, 
>>>>>>>>>> others keep failing with enospc.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
>>>>>>>>>>> 
>>>>>>>>>

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Sergey Malinin
Sent download link by email. verbosity=10, over 900M uncompressed.


> On 2.10.2018, at 16:52, Igor Fedotov  wrote:
> 
> May I have a repair log for that "already expanded" OSD?
> 
> 
> On 10/2/2018 4:32 PM, Sergey Malinin wrote:
>> Repair goes through only when LVM volume has been expanded, otherwise it 
>> fails with enospc as well as any other operation. However, expanding the 
>> volume immediately renders bluefs unmountable with IO error.
>> 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very 
>> end of bluefs-log-dump), I'm not sure whether corruption occurred before or 
>> after volume expansion.
>> 
>> 
>>> On 2.10.2018, at 16:07, Igor Fedotov  wrote:
>>> 
>>> You mentioned repair had worked before, is that correct? What's the 
>>> difference now except the applied patch? Different OSD? Anything else?
>>> 
>>> 
>>> On 10/2/2018 3:52 PM, Sergey Malinin wrote:
>>> 
>>>> It didn't work, emailed logs to you.
>>>> 
>>>> 
>>>>> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
>>>>> 
>>>>> The major change is in get_bluefs_rebalance_txn function, it lacked 
>>>>> bluefs_rebalance_txn assignment..
>>>>> 
>>>>> 
>>>>> 
>>>>> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
>>>>>> PR doesn't seem to have changed since yesterday. Am I missing something?
>>>>>> 
>>>>>> 
>>>>>>> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
>>>>>>> 
>>>>>>> Please update the patch from the PR - it didn't update bluefs extents 
>>>>>>> list before.
>>>>>>> 
>>>>>>> Also please set debug bluestore 20 when re-running repair and collect 
>>>>>>> the log.
>>>>>>> 
>>>>>>> If repair doesn't help - would you send repair and startup logs 
>>>>>>> directly to me as I have some issues accessing ceph-post-file uploads.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Igor
>>>>>>> 
>>>>>>> 
>>>>>>> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
>>>>>>>> Yes, I did repair all OSDs and it finished with 'repair success'. I 
>>>>>>>> backed up OSDs so now I have more room to play.
>>>>>>>> I posted log files using ceph-post-file with the following IDs:
>>>>>>>> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
>>>>>>>> 20df7df5-f0c9-4186-aa21-4e5c0172cd93
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
>>>>>>>>> 
>>>>>>>>> You did repair for any of this OSDs, didn't you? For all of them?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Would you please provide the log for both types (failed on mount and 
>>>>>>>>> failed with enospc) of failing OSDs. Prior to collecting please 
>>>>>>>>> remove existing ones prior and set debug bluestore to 20.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>>>>>>>>>> I was able to apply patches to mimic, but nothing changed. One osd 
>>>>>>>>>> that I had space expanded on fails with bluefs mount IO error, 
>>>>>>>>>> others keep failing with enospc.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
>>>>>>>>>>> 
>>>>>>>>>>> So you should call repair which rebalances (i.e. allocates 
>>>>>>>>>>> additional space) BlueFS space. Hence allowing OSD to start.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Igor
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
>&

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Sergey Malinin
Repair goes through only when LVM volume has been expanded, otherwise it fails 
with enospc as well as any other operation. However, expanding the volume 
immediately renders bluefs unmountable with IO error. 
2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end 
of bluefs-log-dump), I'm not sure whether corruption occurred before or after 
volume expansion.


> On 2.10.2018, at 16:07, Igor Fedotov  wrote:
> 
> You mentioned repair had worked before, is that correct? What's the 
> difference now except the applied patch? Different OSD? Anything else?
> 
> 
> On 10/2/2018 3:52 PM, Sergey Malinin wrote:
> 
>> It didn't work, emailed logs to you.
>> 
>> 
>>> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
>>> 
>>> The major change is in get_bluefs_rebalance_txn function, it lacked 
>>> bluefs_rebalance_txn assignment..
>>> 
>>> 
>>> 
>>> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
>>>> PR doesn't seem to have changed since yesterday. Am I missing something?
>>>> 
>>>> 
>>>>> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
>>>>> 
>>>>> Please update the patch from the PR - it didn't update bluefs extents 
>>>>> list before.
>>>>> 
>>>>> Also please set debug bluestore 20 when re-running repair and collect the 
>>>>> log.
>>>>> 
>>>>> If repair doesn't help - would you send repair and startup logs directly 
>>>>> to me as I have some issues accessing ceph-post-file uploads.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Igor
>>>>> 
>>>>> 
>>>>> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
>>>>>> Yes, I did repair all OSDs and it finished with 'repair success'. I 
>>>>>> backed up OSDs so now I have more room to play.
>>>>>> I posted log files using ceph-post-file with the following IDs:
>>>>>> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
>>>>>> 20df7df5-f0c9-4186-aa21-4e5c0172cd93
>>>>>> 
>>>>>> 
>>>>>>> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
>>>>>>> 
>>>>>>> You did repair for any of this OSDs, didn't you? For all of them?
>>>>>>> 
>>>>>>> 
>>>>>>> Would you please provide the log for both types (failed on mount and 
>>>>>>> failed with enospc) of failing OSDs. Prior to collecting please remove 
>>>>>>> existing ones prior and set debug bluestore to 20.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>>>>>>>> I was able to apply patches to mimic, but nothing changed. One osd 
>>>>>>>> that I had space expanded on fails with bluefs mount IO error, others 
>>>>>>>> keep failing with enospc.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
>>>>>>>>> 
>>>>>>>>> So you should call repair which rebalances (i.e. allocates additional 
>>>>>>>>> space) BlueFS space. Hence allowing OSD to start.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Igor
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
>>>>>>>>>> Not exactly. The rebalancing from this kv_sync_thread still might be 
>>>>>>>>>> deferred due to the nature of this thread (haven't 100% sure though).
>>>>>>>>>> 
>>>>>>>>>> Here is my PR showing the idea (still untested and perhaps 
>>>>>>>>>> unfinished!!!)
>>>>>>>>>> 
>>>>>>>>>> https://github.com/ceph/ceph/pull/24353
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Igor
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 10/1/2018 7:07 PM, Sergey Malinin wrote:
>>>>>>>>>>> Can you please confirm whether I 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Sergey Malinin
It didn't work, emailed logs to you.


> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
> 
> The major change is in get_bluefs_rebalance_txn function, it lacked 
> bluefs_rebalance_txn assignment..
> 
> 
> 
> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
>> PR doesn't seem to have changed since yesterday. Am I missing something?
>> 
>> 
>>> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
>>> 
>>> Please update the patch from the PR - it didn't update bluefs extents list 
>>> before.
>>> 
>>> Also please set debug bluestore 20 when re-running repair and collect the 
>>> log.
>>> 
>>> If repair doesn't help - would you send repair and startup logs directly to 
>>> me as I have some issues accessing ceph-post-file uploads.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
>>>> Yes, I did repair all OSDs and it finished with 'repair success'. I backed 
>>>> up OSDs so now I have more room to play.
>>>> I posted log files using ceph-post-file with the following IDs:
>>>> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
>>>> 20df7df5-f0c9-4186-aa21-4e5c0172cd93
>>>> 
>>>> 
>>>>> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
>>>>> 
>>>>> You did repair for any of this OSDs, didn't you? For all of them?
>>>>> 
>>>>> 
>>>>> Would you please provide the log for both types (failed on mount and 
>>>>> failed with enospc) of failing OSDs. Prior to collecting please remove 
>>>>> existing ones prior and set debug bluestore to 20.
>>>>> 
>>>>> 
>>>>> 
>>>>> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>>>>>> I was able to apply patches to mimic, but nothing changed. One osd that 
>>>>>> I had space expanded on fails with bluefs mount IO error, others keep 
>>>>>> failing with enospc.
>>>>>> 
>>>>>> 
>>>>>>> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
>>>>>>> 
>>>>>>> So you should call repair which rebalances (i.e. allocates additional 
>>>>>>> space) BlueFS space. Hence allowing OSD to start.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Igor
>>>>>>> 
>>>>>>> 
>>>>>>> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
>>>>>>>> Not exactly. The rebalancing from this kv_sync_thread still might be 
>>>>>>>> deferred due to the nature of this thread (haven't 100% sure though).
>>>>>>>> 
>>>>>>>> Here is my PR showing the idea (still untested and perhaps 
>>>>>>>> unfinished!!!)
>>>>>>>> 
>>>>>>>> https://github.com/ceph/ceph/pull/24353
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Igor
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 10/1/2018 7:07 PM, Sergey Malinin wrote:
>>>>>>>>> Can you please confirm whether I got this right:
>>>>>>>>> 
>>>>>>>>> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
>>>>>>>>> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
>>>>>>>>> @@ -9049,22 +9049,17 @@
>>>>>>>>> throttle_bytes.put(costs);
>>>>>>>>>   PExtentVector bluefs_gift_extents;
>>>>>>>>> -  if (bluefs &&
>>>>>>>>> -  after_flush - bluefs_last_balance >
>>>>>>>>> -  cct->_conf->bluestore_bluefs_balance_interval) {
>>>>>>>>> -bluefs_last_balance = after_flush;
>>>>>>>>> -int r = _balance_bluefs_freespace(_gift_extents);
>>>>>>>>> -assert(r >= 0);
>>>>>>>>> -if (r > 0) {
>>>>>>>>> -  for (auto& p : bluefs_gift_extents) {
>>>>>>>>> -bluefs_extents.insert(p.offset, p.length);
>>>>>>>>> -  }
>>>>>>>>> -  bufferlist bl;
>>>>>>>>> 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Sergey Malinin
PR doesn't seem to have changed since yesterday. Am I missing something?


> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
> 
> Please update the patch from the PR - it didn't update bluefs extents list 
> before.
> 
> Also please set debug bluestore 20 when re-running repair and collect the log.
> 
> If repair doesn't help - would you send repair and startup logs directly to 
> me as I have some issues accessing ceph-post-file uploads.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
>> Yes, I did repair all OSDs and it finished with 'repair success'. I backed 
>> up OSDs so now I have more room to play.
>> I posted log files using ceph-post-file with the following IDs:
>> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
>> 20df7df5-f0c9-4186-aa21-4e5c0172cd93
>> 
>> 
>>> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
>>> 
>>> You did repair for any of this OSDs, didn't you? For all of them?
>>> 
>>> 
>>> Would you please provide the log for both types (failed on mount and failed 
>>> with enospc) of failing OSDs. Prior to collecting please remove existing 
>>> ones prior and set debug bluestore to 20.
>>> 
>>> 
>>> 
>>> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>>>> I was able to apply patches to mimic, but nothing changed. One osd that I 
>>>> had space expanded on fails with bluefs mount IO error, others keep 
>>>> failing with enospc.
>>>> 
>>>> 
>>>>> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
>>>>> 
>>>>> So you should call repair which rebalances (i.e. allocates additional 
>>>>> space) BlueFS space. Hence allowing OSD to start.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Igor
>>>>> 
>>>>> 
>>>>> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
>>>>>> Not exactly. The rebalancing from this kv_sync_thread still might be 
>>>>>> deferred due to the nature of this thread (haven't 100% sure though).
>>>>>> 
>>>>>> Here is my PR showing the idea (still untested and perhaps unfinished!!!)
>>>>>> 
>>>>>> https://github.com/ceph/ceph/pull/24353
>>>>>> 
>>>>>> 
>>>>>> Igor
>>>>>> 
>>>>>> 
>>>>>> On 10/1/2018 7:07 PM, Sergey Malinin wrote:
>>>>>>> Can you please confirm whether I got this right:
>>>>>>> 
>>>>>>> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
>>>>>>> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
>>>>>>> @@ -9049,22 +9049,17 @@
>>>>>>> throttle_bytes.put(costs);
>>>>>>>   PExtentVector bluefs_gift_extents;
>>>>>>> -  if (bluefs &&
>>>>>>> -  after_flush - bluefs_last_balance >
>>>>>>> -  cct->_conf->bluestore_bluefs_balance_interval) {
>>>>>>> -bluefs_last_balance = after_flush;
>>>>>>> -int r = _balance_bluefs_freespace(_gift_extents);
>>>>>>> -assert(r >= 0);
>>>>>>> -if (r > 0) {
>>>>>>> -  for (auto& p : bluefs_gift_extents) {
>>>>>>> -bluefs_extents.insert(p.offset, p.length);
>>>>>>> -  }
>>>>>>> -  bufferlist bl;
>>>>>>> -  encode(bluefs_extents, bl);
>>>>>>> -  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>>>>>> -   << bluefs_extents << std::dec << dendl;
>>>>>>> -  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>>>>>> +  int r = _balance_bluefs_freespace(_gift_extents);
>>>>>>> +  ceph_assert(r >= 0);
>>>>>>> +  if (r > 0) {
>>>>>>> +for (auto& p : bluefs_gift_extents) {
>>>>>>> +  bluefs_extents.insert(p.offset, p.length);
>>>>>>>   }
>>>>>>> +bufferlist bl;
>>>>>>> +encode(bluefs_extents, bl);
>>>>>>> +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>>>>>> + << bluefs_extents <&l

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Sergey Malinin
Yes, I did repair all OSDs and it finished with 'repair success'. I backed up 
OSDs so now I have more room to play.
I posted log files using ceph-post-file with the following IDs:
4af9cc4d-9c73-41c9-9c38-eb6c551047a0
20df7df5-f0c9-4186-aa21-4e5c0172cd93


> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
> 
> You did repair for any of this OSDs, didn't you? For all of them?
> 
> 
> Would you please provide the log for both types (failed on mount and failed 
> with enospc) of failing OSDs. Prior to collecting please remove existing ones 
> prior and set debug bluestore to 20.
> 
> 
> 
> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>> I was able to apply patches to mimic, but nothing changed. One osd that I 
>> had space expanded on fails with bluefs mount IO error, others keep failing 
>> with enospc.
>> 
>> 
>>> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
>>> 
>>> So you should call repair which rebalances (i.e. allocates additional 
>>> space) BlueFS space. Hence allowing OSD to start.
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
>>>> Not exactly. The rebalancing from this kv_sync_thread still might be 
>>>> deferred due to the nature of this thread (haven't 100% sure though).
>>>> 
>>>> Here is my PR showing the idea (still untested and perhaps unfinished!!!)
>>>> 
>>>> https://github.com/ceph/ceph/pull/24353
>>>> 
>>>> 
>>>> Igor
>>>> 
>>>> 
>>>> On 10/1/2018 7:07 PM, Sergey Malinin wrote:
>>>>> Can you please confirm whether I got this right:
>>>>> 
>>>>> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
>>>>> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
>>>>> @@ -9049,22 +9049,17 @@
>>>>> throttle_bytes.put(costs);
>>>>>   PExtentVector bluefs_gift_extents;
>>>>> -  if (bluefs &&
>>>>> -  after_flush - bluefs_last_balance >
>>>>> -  cct->_conf->bluestore_bluefs_balance_interval) {
>>>>> -bluefs_last_balance = after_flush;
>>>>> -int r = _balance_bluefs_freespace(_gift_extents);
>>>>> -assert(r >= 0);
>>>>> -if (r > 0) {
>>>>> -  for (auto& p : bluefs_gift_extents) {
>>>>> -bluefs_extents.insert(p.offset, p.length);
>>>>> -  }
>>>>> -  bufferlist bl;
>>>>> -  encode(bluefs_extents, bl);
>>>>> -  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>>>> -   << bluefs_extents << std::dec << dendl;
>>>>> -  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>>>> +  int r = _balance_bluefs_freespace(_gift_extents);
>>>>> +  ceph_assert(r >= 0);
>>>>> +  if (r > 0) {
>>>>> +for (auto& p : bluefs_gift_extents) {
>>>>> +  bluefs_extents.insert(p.offset, p.length);
>>>>>   }
>>>>> +bufferlist bl;
>>>>> +encode(bluefs_extents, bl);
>>>>> +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>>>> + << bluefs_extents << std::dec << dendl;
>>>>> +synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>>>> }
>>>>>   // cleanup sync deferred keys
>>>>> 
>>>>>> On 1.10.2018, at 18:39, Igor Fedotov  wrote:
>>>>>> 
>>>>>> So you have just a single main device per OSD
>>>>>> 
>>>>>> Then bluestore-tool wouldn't help, it's unable to expand BlueFS 
>>>>>> partition at main device, standalone devices are supported only.
>>>>>> 
>>>>>> Given that you're able to rebuild the code I can suggest to make a patch 
>>>>>> that triggers BlueFS rebalance (see code snippet below) on repairing.
>>>>>>  PExtentVector bluefs_gift_extents;
>>>>>>  int r = _balance_bluefs_freespace(_gift_extents);
>>>>>>  ceph_assert(r >= 0);
>>>>>>  if (r > 0) {
>>>>>>for (auto& p : bluefs_gift_extents) {
>>>>>>  bluefs_extents.insert(p.offset

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Sergey Malinin
I was able to apply patches to mimic, but nothing changed. One osd that I had 
space expanded on fails with bluefs mount IO error, others keep failing with 
enospc.


> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
> 
> So you should call repair which rebalances (i.e. allocates additional space) 
> BlueFS space. Hence allowing OSD to start.
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
>> Not exactly. The rebalancing from this kv_sync_thread still might be 
>> deferred due to the nature of this thread (haven't 100% sure though).
>> 
>> Here is my PR showing the idea (still untested and perhaps unfinished!!!)
>> 
>> https://github.com/ceph/ceph/pull/24353
>> 
>> 
>> Igor
>> 
>> 
>> On 10/1/2018 7:07 PM, Sergey Malinin wrote:
>>> Can you please confirm whether I got this right:
>>> 
>>> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
>>> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
>>> @@ -9049,22 +9049,17 @@
>>> throttle_bytes.put(costs);
>>>   PExtentVector bluefs_gift_extents;
>>> -  if (bluefs &&
>>> -  after_flush - bluefs_last_balance >
>>> -  cct->_conf->bluestore_bluefs_balance_interval) {
>>> -bluefs_last_balance = after_flush;
>>> -int r = _balance_bluefs_freespace(_gift_extents);
>>> -assert(r >= 0);
>>> -if (r > 0) {
>>> -  for (auto& p : bluefs_gift_extents) {
>>> -bluefs_extents.insert(p.offset, p.length);
>>> -  }
>>> -  bufferlist bl;
>>> -  encode(bluefs_extents, bl);
>>> -  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>> -   << bluefs_extents << std::dec << dendl;
>>> -  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>> +  int r = _balance_bluefs_freespace(_gift_extents);
>>> +  ceph_assert(r >= 0);
>>> +  if (r > 0) {
>>> +for (auto& p : bluefs_gift_extents) {
>>> +  bluefs_extents.insert(p.offset, p.length);
>>>   }
>>> +bufferlist bl;
>>> +encode(bluefs_extents, bl);
>>> +dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>> + << bluefs_extents << std::dec << dendl;
>>> +synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>> }
>>>   // cleanup sync deferred keys
>>> 
>>>> On 1.10.2018, at 18:39, Igor Fedotov  wrote:
>>>> 
>>>> So you have just a single main device per OSD
>>>> 
>>>> Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition 
>>>> at main device, standalone devices are supported only.
>>>> 
>>>> Given that you're able to rebuild the code I can suggest to make a patch 
>>>> that triggers BlueFS rebalance (see code snippet below) on repairing.
>>>>  PExtentVector bluefs_gift_extents;
>>>>  int r = _balance_bluefs_freespace(_gift_extents);
>>>>  ceph_assert(r >= 0);
>>>>  if (r > 0) {
>>>>for (auto& p : bluefs_gift_extents) {
>>>>  bluefs_extents.insert(p.offset, p.length);
>>>>}
>>>>bufferlist bl;
>>>>encode(bluefs_extents, bl);
>>>>dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
>>>> << bluefs_extents << std::dec << dendl;
>>>>synct->set(PREFIX_SUPER, "bluefs_extents", bl);
>>>>  }
>>>> 
>>>> If it waits I can probably make a corresponding PR tomorrow.
>>>> 
>>>> Thanks,
>>>> Igor
>>>> On 10/1/2018 6:16 PM, Sergey Malinin wrote:
>>>>> I have rebuilt the tool, but none of my OSDs no matter dead or alive have 
>>>>> any symlinks other than 'block' pointing to LVM.
>>>>> I adjusted main device size but it looks like it needs even more space 
>>>>> for db compaction. After executing bluefs-bdev-expand OSD fails to start, 
>>>>> however 'fsck' and 'repair' commands finished successfully.
>>>>> 
>>>>> 2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
>>>>> 2018-10-01 18:02:39.763 7fc9226c6240  1 
>>>>> bluestore(/var/lib/ce

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Sergey Malinin
Can you please confirm whether I got this right:

--- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
   throttle_bytes.put(costs);
 
   PExtentVector bluefs_gift_extents;
-  if (bluefs &&
- after_flush - bluefs_last_balance >
- cct->_conf->bluestore_bluefs_balance_interval) {
-   bluefs_last_balance = after_flush;
-   int r = _balance_bluefs_freespace(_gift_extents);
-   assert(r >= 0);
-   if (r > 0) {
- for (auto& p : bluefs_gift_extents) {
-   bluefs_extents.insert(p.offset, p.length);
- }
- bufferlist bl;
- encode(bluefs_extents, bl);
- dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
-  << bluefs_extents << std::dec << dendl;
- synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = _balance_bluefs_freespace(_gift_extents);
+  ceph_assert(r >= 0);
+  if (r > 0) {
+   for (auto& p : bluefs_gift_extents) {
+ bluefs_extents.insert(p.offset, p.length);
}
+   bufferlist bl;
+   encode(bluefs_extents, bl);
+   dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
+<< bluefs_extents << std::dec << dendl;
+   synct->set(PREFIX_SUPER, "bluefs_extents", bl);
   }
 
   // cleanup sync deferred keys

> On 1.10.2018, at 18:39, Igor Fedotov  wrote:
> 
> So you have just a single main device per OSD
> 
> Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at 
> main device, standalone devices are supported only.
> 
> Given that you're able to rebuild the code I can suggest to make a patch that 
> triggers BlueFS rebalance (see code snippet below) on repairing.
> PExtentVector bluefs_gift_extents;
> int r = _balance_bluefs_freespace(_gift_extents);
> ceph_assert(r >= 0);
> if (r > 0) {
>   for (auto& p : bluefs_gift_extents) {
> bluefs_extents.insert(p.offset, p.length);
>   }
>   bufferlist bl;
>   encode(bluefs_extents, bl);
>   dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
><< bluefs_extents << std::dec << dendl;
>   synct->set(PREFIX_SUPER, "bluefs_extents", bl);
> }
> 
> If it waits I can probably make a corresponding PR tomorrow.
> 
> Thanks,
> Igor
> On 10/1/2018 6:16 PM, Sergey Malinin wrote:
>> I have rebuilt the tool, but none of my OSDs no matter dead or alive have 
>> any symlinks other than 'block' pointing to LVM.
>> I adjusted main device size but it looks like it needs even more space for 
>> db compaction. After executing bluefs-bdev-expand OSD fails to start, 
>> however 'fsck' and 'repair' commands finished successfully.
>> 
>> 2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
>> 2018-10-01 18:02:39.763 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
>> _open_alloc opening allocation metadata
>> 2018-10-01 18:02:40.907 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
>> _open_alloc loaded 285 GiB in 2249899 extents
>> 2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) 
>> _reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80]
>> 2018-10-01 18:02:40.951 7fc9226c6240  1 stupidalloc 0x0x55d053fb9180 shutdown
>> 2018-10-01 18:02:40.963 7fc9226c6240  1 freelist shutdown
>> 2018-10-01 18:02:40.963 7fc9226c6240  4 rocksdb: 
>> [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all 
>> background work
>> 2018-10-01 18:02:40.967 7fc9226c6240  4 rocksdb: 
>> [/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete
>> 2018-10-01 18:02:40.971 7fc9226c6240  1 bluefs umount
>> 2018-10-01 18:02:40.975 7fc9226c6240  1 stupidalloc 0x0x55d053883800 shutdown
>> 2018-10-01 18:02:40.975 7fc9226c6240  1 bdev(0x55d053c32e00 
>> /var/lib/ceph/osd/ceph-1/block) close
>> 2018-10-01 18:02:41.267 7fc9226c6240  1 bdev(0x55d053c32a80 
>> /var/lib/ceph/osd/ceph-1/block) close
>> 2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount 
>> object store
>> 2018-10-01 18:02:41.443 7fc9226c6240 -1  ** ERROR: osd init failed: (5) 
>> Input/output error
>> 
>> 
>>> On 1.10.2018, at 18:09, Igor Fedotov  wrote:
>>> 
>>> Well, actually you can avoid bluestore-tool rebuild.
>>> 
>>> You'll need to edit the first chunk of blocks.db where labels are stored. 
>>> (Please make a backup first!!!)

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Sergey Malinin
I have rebuilt the tool, but none of my OSDs no matter dead or alive have any 
symlinks other than 'block' pointing to LVM.
I adjusted main device size but it looks like it needs even more space for db 
compaction. After executing bluefs-bdev-expand OSD fails to start, however 
'fsck' and 'repair' commands finished successfully.

2018-10-01 18:02:39.755 7fc9226c6240  1 freelist init
2018-10-01 18:02:39.763 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc opening allocation metadata
2018-10-01 18:02:40.907 7fc9226c6240  1 bluestore(/var/lib/ceph/osd/ceph-1) 
_open_alloc loaded 285 GiB in 2249899 extents
2018-10-01 18:02:40.951 7fc9226c6240 -1 bluestore(/var/lib/ceph/osd/ceph-1) 
_reconcile_bluefs_freespace bluefs extra 0x[6d6f00~50c80]
2018-10-01 18:02:40.951 7fc9226c6240  1 stupidalloc 0x0x55d053fb9180 shutdown
2018-10-01 18:02:40.963 7fc9226c6240  1 freelist shutdown
2018-10-01 18:02:40.963 7fc9226c6240  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all 
background work
2018-10-01 18:02:40.967 7fc9226c6240  4 rocksdb: 
[/build/ceph-13.2.2/src/rocksdb/db/db_impl.cc:397] Shutdown complete
2018-10-01 18:02:40.971 7fc9226c6240  1 bluefs umount
2018-10-01 18:02:40.975 7fc9226c6240  1 stupidalloc 0x0x55d053883800 shutdown
2018-10-01 18:02:40.975 7fc9226c6240  1 bdev(0x55d053c32e00 
/var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.267 7fc9226c6240  1 bdev(0x55d053c32a80 
/var/lib/ceph/osd/ceph-1/block) close
2018-10-01 18:02:41.443 7fc9226c6240 -1 osd.1 0 OSD:init: unable to mount 
object store
2018-10-01 18:02:41.443 7fc9226c6240 -1  ** ERROR: osd init failed: (5) 
Input/output error


> On 1.10.2018, at 18:09, Igor Fedotov  wrote:
> 
> Well, actually you can avoid bluestore-tool rebuild.
> 
> You'll need to edit the first chunk of blocks.db where labels are stored. 
> (Please make a backup first!!!)
> 
> Size label is stored at offset 0x52 and is 8 bytes long - little-endian 64bit 
> integer encoding. (Please verify that old value at this offset exactly 
> corresponds to you original volume size and/or 'size' label reported by 
> ceph-bluestore-tool).
> 
> So you have to put new DB volume size there. Or you can send the first 4K 
> chunk (e.g. extracted with dd) along with new DB volume size (in bytes) to me 
> and I'll do that for you.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/1/2018 5:32 PM, Igor Fedotov wrote:
>> 
>> 
>> On 10/1/2018 5:03 PM, Sergey Malinin wrote:
>>> Before I received your response, I had already added 20GB to the OSD (by 
>>> epanding LV followed by bluefs-bdev-expand) and ran "ceph-kvstore-tool 
>>> bluestore-kv  compact", however it still needs more space.
>>> Is that because I didn't update DB size with set-label-key?
>> In mimic you need to run both "bluefs-bdev-expand" and "set-label-key" 
>> command to commit bluefs volume expansion.
>> Unfortunately the last command doesn't handle "size" label properly. That's 
>> why you might need to backport and rebuild with the mentioned commits.
>> 
>>> What exactly is the label-key that needs to be updated, as I couldn't find 
>>> which one is related to DB:
>>> 
>>> # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1
>>> inferring bluefs devices from bluestore path
>>> {
>>>  "/var/lib/ceph/osd/ceph-1/block": {
>>>  "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc",
>>>  "size": 471305551872,
>>>  "btime": "2018-07-31 03:06:43.751243",
>>>  "description": "main",
>>>  "bluefs": "1",
>>>  "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533",
>>>  "kv_backend": "rocksdb",
>>>  "magic": "ceph osd volume v026",
>>>  "mkfs_done": "yes",
>>>  "osd_key": "XXX",
>>>  "ready": "ready",
>>>  "whoami": "1"
>>>  }
>>> }
>> 'size' label but your output is for block(aka slow) device.
>> 
>> It should return labels for db/wal devices as well (block.db and block.wal 
>> symlinks respectively). It works for me in master, can't verify with mimic 
>> at the moment though..
>> Here is output for master:
>> 
>> # bin/ceph-bluestore-tool show-label --path dev/osd0
>> inferring bluefs devices from bluestore path
>> {
>> "dev/osd0/block": {
>> "osd_uuid": "404dcbe

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Sergey Malinin
Before I received your response, I had already added 20GB to the OSD (by 
epanding LV followed by bluefs-bdev-expand) and ran "ceph-kvstore-tool 
bluestore-kv  compact", however it still needs more space.
Is that because I didn't update DB size with set-label-key?

What exactly is the label-key that needs to be updated, as I couldn't find 
which one is related to DB:

# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
{
"/var/lib/ceph/osd/ceph-1/block": {
"osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc",
"size": 471305551872,
"btime": "2018-07-31 03:06:43.751243",
"description": "main",
"bluefs": "1",
"ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "XXX",
"ready": "ready",
"whoami": "1"
}
}


> On 1.10.2018, at 16:48, Igor Fedotov  wrote:
> 
> This looks like a sort of deadlock when BlueFS needs some additional space to 
> replay the log left after the crash. Which happens during BlueFS open.
> 
> But such a space (at slow device as DB is full) is gifted in background 
> during bluefs rebalance procedure which will occur after the open.
> 
> Hence OSDs stuck in permanent crashing..
> 
> The only way to recover I can suggest for now is to expand DB volumes. You 
> can do that with lvm tools if you have any spare space for that.
> 
> Once resized you'll need ceph-bluestore-tool to indicate volume expansion to 
> BlueFS (bluefs-bdev-expand command ) and finally update DB volume size label 
> with  set-label-key command.
> 
> The latter is a bit tricky for mimic - you might need to backport 
> https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b
> 
> and rebuild ceph-bluestore-tool. Alternatively you can backport 
> https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b
> 
> then bluefs expansion and label updates will occur in a single step.
> 
> I'll do these backports in upstream but this will take some time to pass all 
> the procedures and get into official mimic  release.
> 
> Will fire a ticket to fix the original issue as well.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/1/2018 3:28 PM, Sergey Malinin wrote:
>> These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm prepare 
>> --bluestore /dev/nvme0n1p3" i.e. without specifying wal/db devices.
>> OSDs were created with bluestore_min_alloc_size_ssd=4096, another modified 
>> setting is bluestore_cache_kv_max=1073741824
>> 
>> DB/block usage collected by prometheus module for 3 failed and 1 survived 
>> OSDs:
>> 
>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0
>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0
>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 --> this one 
>> has survived
>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0
>> 
>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0
>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0
>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0
>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0
>> 
>> ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0
>> ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0
>> ceph_osd_stat_bytes{ceph_daemon="osd.2"} 471305551872.0
>> ceph_osd_stat_bytes{ceph_daemon="osd.3"} 471305551872.0
>> 
>> ceph_osd_stat_bytes_used{ceph_daemon="osd.0"} 222328213504.0
>> ceph_osd_stat_bytes_used{ceph_daemon="osd.1"} 214472544256.0
>> ceph_osd_stat_bytes_used{ceph_daemon="osd.2"} 163603996672.0
>> ceph_osd_stat_bytes_used{ceph_daemon="osd.3"} 212806815744.0
>> 
>> 
>> First crashed OSD was doing DB compaction, others crashed shortly after 
>> during backfilling. Workload was "ceph-data-scan scan_inodes" filling 
>> metadata pool located on these OSDs at the rate close to 10k objects/second.
>> Here is the log excerpt of the first crash occurrence:
>> 
>> 2018-10-01 03:27:12.762 7fbf16dd6700  0 bluestore(/var/lib/ceph/osd/ceph-1) 
>> _balance_bluefs_freespace no allocate on 0x800

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Sergey Malinin
()+0x1b9) [0x56193271c399]
 6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x56193271d42b]
 7: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, 
rocksdb::CompactionJob::SubcompactionState*, rocksdb::RangeDelAggregator*, 
CompactionIterationStats*, rocksdb::Slice const*)+0x3db) [0x56193276098b]
 8: 
(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d9)
 [0x561932763da9]
 9: (rocksdb::CompactionJob::Run()+0x314) [0x561932765504]
 10: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xc54) 
[0x5619325b5c44]
 11: 
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
 rocksdb::Env::Priority)+0x397) [0x5619325b8557]
 12: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x5619325b8cd7]
 13: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x266) 
[0x5619327a5e36]
 14: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x47) 
[0x5619327a5fb7]
 15: (()+0xbe733) [0x7fbf2b500733]
 16: (()+0x76db) [0x7fbf2bbf86db]
 17: (clone()+0x3f) [0x7fbf2abbc88f]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


> On 1.10.2018, at 15:01, Igor Fedotov  wrote:
> 
> Hi Sergey,
> 
> could you please provide more details on your OSDs ?
> 
> What are sizes for DB/block devices?
> 
> Do you have any modifications in BlueStore config settings?
> 
> Can you share stats you're referring to?
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/1/2018 12:29 PM, Sergey Malinin wrote:
>> Hello,
>> 3 of 4 NVME OSDs crashed at the same time on assert(0 == "bluefs enospc") 
>> and no longer start.
>> Stats collected just before crash show that ceph_bluefs_db_used_bytes is 
>> 100% used. Although OSDs have over 50% of free space, it is not reallocated 
>> for DB usage.
>> 
>> 2018-10-01 12:18:06.744 7f1d6a04d240  1 bluefs _allocate failed to allocate 
>> 0x10 on bdev 1, free 0x0; fallback to bdev 2
>> 2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _allocate failed to allocate 
>> 0x10 on bdev 2, dne
>> 2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _flush_range allocated: 0x0 
>> offset: 0x0 length: 0xa8700
>> 2018-10-01 12:18:06.748 7f1d6a04d240 -1 
>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int 
>> BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 
>> 7f1d6a04d240 time 2018-10-01 12:18:06.746800
>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED assert(0 == 
>> "bluefs enospc")
>> 
>>  ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic 
>> (stable)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x102) [0x7f1d6146f5c2]
>>  2: (()+0x26c787) [0x7f1d6146f787]
>>  3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
>> long)+0x1ab4) [0x5586b22684b4]
>>  4: (BlueRocksWritableFile::Flush()+0x3d) [0x5586b227ec1d]
>>  5: (rocksdb::WritableFileWriter::Flush()+0x1b9) [0x5586b2473399]
>>  6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x5586b247442b]
>>  7: (rocksdb::BuildTable(std::__cxx11::basic_string> std::char_traits, std::allocator > const&, rocksdb::Env*, 
>> rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, 
>> rocksdb::EnvOptions const&, rock
>> sdb::TableCache*, rocksdb::InternalIterator*, 
>> std::unique_ptr> std::default_delete >, rocksdb::FileMetaData*, 
>> rocksdb::InternalKeyComparator const&, std::vector> rocksdb::IntTblPropCollectorFactory, 
>> std::default_delete >, 
>> std::allocator> std::default_delete > > > co
>> nst*, unsigned int, std::__cxx11::basic_string, 
>> std::allocator > const&, std::vector> std::allocator >, unsigned long, rocksdb::SnapshotChecker*, 
>> rocksdb::Compression
>> Type, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, 
>> rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, 
>> rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, 
>> unsigned long, rocksdb
>> ::Env::WriteLifeTimeHint)+0x1e24) [0x5586b249ef94]
>>  8: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, 
>> rocksdb::ColumnFamilyData*, rocksdb::MemTable*, 
>> rocksdb::VersionEdit*)+0xcb7) [0x5586b2321457]
>>  9: (rocksdb::DBImpl::RecoverLogFiles(std::vector> std::allocator > const&, unsigned long*, bool)+0x19de) 
>> [0x5586b232373e]
>>  10: (rocksdb::DBImpl::Recover(std::vector> std::allocator > const&, bool, bool, 
>> bool)+0x5d4) [0x5586b23242f4]

[ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-01 Thread Sergey Malinin
Hello,
3 of 4 NVME OSDs crashed at the same time on assert(0 == "bluefs enospc") and 
no longer start.
Stats collected just before crash show that ceph_bluefs_db_used_bytes is 100% 
used. Although OSDs have over 50% of free space, it is not reallocated for DB 
usage.

2018-10-01 12:18:06.744 7f1d6a04d240  1 bluefs _allocate failed to allocate 
0x10 on bdev 1, free 0x0; fallback to bdev 2
2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _allocate failed to allocate 
0x10 on bdev 2, dne
2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _flush_range allocated: 0x0 
offset: 0x0 length: 0xa8700
2018-10-01 12:18:06.748 7f1d6a04d240 -1 
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int 
BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 
7f1d6a04d240 time 2018-10-01 12:18:06.746800
/build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED assert(0 == "bluefs 
enospc")

 ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x102) [0x7f1d6146f5c2]
 2: (()+0x26c787) [0x7f1d6146f787]
 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
long)+0x1ab4) [0x5586b22684b4]
 4: (BlueRocksWritableFile::Flush()+0x3d) [0x5586b227ec1d]
 5: (rocksdb::WritableFileWriter::Flush()+0x1b9) [0x5586b2473399]
 6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x5586b247442b]
 7: (rocksdb::BuildTable(std::__cxx11::basic_string, std::allocator > const&, rocksdb::Env*, 
rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, 
rocksdb::EnvOptions const&, rock
sdb::TableCache*, rocksdb::InternalIterator*, 
std::unique_ptr >, rocksdb::FileMetaData*, 
rocksdb::InternalKeyComparator const&, std::vector >, 
std::allocator > > > co
nst*, unsigned int, std::__cxx11::basic_string, 
std::allocator > const&, std::vector >, unsigned long, rocksdb::SnapshotChecker*, 
rocksdb::Compression
Type, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, 
rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, 
rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, 
unsigned long, rocksdb
::Env::WriteLifeTimeHint)+0x1e24) [0x5586b249ef94]
 8: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, 
rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcb7) 
[0x5586b2321457]
 9: (rocksdb::DBImpl::RecoverLogFiles(std::vector > const&, unsigned long*, bool)+0x19de) 
[0x5586b232373e]
 10: (rocksdb::DBImpl::Recover(std::vector > const&, bool, bool, 
bool)+0x5d4) [0x5586b23242f4]
 11: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, std::vector >*, rocksdb::DB**, bool)+0x68b) 
[0x5586b232559b]
 12: (rocksdb::DB::Open(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, std::vector > std::allocator >*, rocksdb::DB**)+0x22) 
> > [0x5586b2326e72]
 13: (RocksDBStore::do_open(std::ostream&, bool, 
std::vector 
> const*)+0x170c) [0x5586b220219c]
 14: (BlueStore::_open_db(bool, bool)+0xd8e) [0x5586b218ee1e]
 15: (BlueStore::_mount(bool, bool)+0x4b7) [0x5586b21bf807]
 16: (OSD::init()+0x295) [0x5586b1d673c5]
 17: (main()+0x268d) [0x5586b1c554ed]
 18: (__libc_start_main()+0xe7) [0x7f1d5ea2db97]
 19: (_start()+0x2a) [0x5586b1d1d7fa]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan tool

2018-09-27 Thread Sergey Malinin
> 
> On 27.09.2018, at 15:04, John Spray  wrote:
> 
> On Thu, Sep 27, 2018 at 11:34 AM Sergey Malinin  wrote:
>> 
>> Can such behaviour be related to data pool cache tiering?
> 
> Yes -- if there's a cache tier in use then deletions in the base pool
> can be delayed and then happen later when the cache entries get
> expired.
> 
> You may find that for a full scan of objects in the system, having a
> cache pool actually slows things down quite a lot, due to the overhead
> of promoting things in and out of the cache as we scan.

'forward' cache mode is still reported dangerous. Is it safe enough to switch 
to forward mode while doing recovery?


> 
> John
> 
>> 
>> 
>>> On 27.09.2018, at 13:14, Sergey Malinin  wrote:
>>> 
>>> I'm trying alternate metadata pool approach. I double checked that MDS 
>>> servers are down and both original and recovery fs are set not joinable.
>>> 
>>> 
>>>> On 27.09.2018, at 13:10, John Spray  wrote:
>>>> 
>>>> On Thu, Sep 27, 2018 at 11:03 AM Sergey Malinin  wrote:
>>>>> 
>>>>> Hello,
>>>>> Does anybody have experience with using cephfs-data-scan tool?
>>>>> Questions I have are how long would it take to scan extents on filesystem 
>>>>> with 120M relatively small files? While running extents scan I noticed 
>>>>> that number of objects in data pool is decreasing over the time. Is that 
>>>>> normal?
>>>> 
>>>> The scan_extents operation does not do any deletions, so that is
>>>> surprising.  Is it possible that you've accidentially left an MDS
>>>> running?
>>>> 
>>>> John
>>>> 
>>>> John
>>>> 
>>>>> Thanks.
>>>>> ___
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan tool

2018-09-27 Thread Sergey Malinin
Can such behaviour be related to data pool cache tiering?


> On 27.09.2018, at 13:14, Sergey Malinin  wrote:
> 
> I'm trying alternate metadata pool approach. I double checked that MDS 
> servers are down and both original and recovery fs are set not joinable.
> 
> 
>> On 27.09.2018, at 13:10, John Spray  wrote:
>> 
>> On Thu, Sep 27, 2018 at 11:03 AM Sergey Malinin  wrote:
>>> 
>>> Hello,
>>> Does anybody have experience with using cephfs-data-scan tool?
>>> Questions I have are how long would it take to scan extents on filesystem 
>>> with 120M relatively small files? While running extents scan I noticed that 
>>> number of objects in data pool is decreasing over the time. Is that normal?
>> 
>> The scan_extents operation does not do any deletions, so that is
>> surprising.  Is it possible that you've accidentially left an MDS
>> running?
>> 
>> John
>> 
>> John
>> 
>>> Thanks.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan tool

2018-09-27 Thread Sergey Malinin
I'm trying alternate metadata pool approach. I double checked that MDS servers 
are down and both original and recovery fs are set not joinable.


> On 27.09.2018, at 13:10, John Spray  wrote:
> 
> On Thu, Sep 27, 2018 at 11:03 AM Sergey Malinin  wrote:
>> 
>> Hello,
>> Does anybody have experience with using cephfs-data-scan tool?
>> Questions I have are how long would it take to scan extents on filesystem 
>> with 120M relatively small files? While running extents scan I noticed that 
>> number of objects in data pool is decreasing over the time. Is that normal?
> 
> The scan_extents operation does not do any deletions, so that is
> surprising.  Is it possible that you've accidentially left an MDS
> running?
> 
> John
> 
> John
> 
>> Thanks.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs-data-scan tool

2018-09-27 Thread Sergey Malinin
Hello,
Does anybody have experience with using cephfs-data-scan tool?
Questions I have are how long would it take to scan extents on filesystem with 
120M relatively small files? While running extents scan I noticed that number 
of objects in data pool is decreasing over the time. Is that normal?
Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-09-26 Thread Sergey Malinin
Hello,
Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2. 
After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
damaged. Resetting purge_queue does not seem to work well as journal still 
appears to be damaged.
Can anybody help?

mds log:

  -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map to 
version 586 from mon.2
  -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i am 
now mds.0.583
  -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map state 
change up:rejoin --> up:active
  -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- 
successful recovery!

   -38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue _consume: 
Decode error at read_pos=0x322ec6636
   -37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 set_want_state: 
up:active -> down:damaged
   -36> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 _send 
down:damaged seq 137
   -35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: _send_mon_message to 
mon.ceph3 at mon:6789/0
   -34> 2018-09-26 18:42:32.707 7f70f28a7700  1 -- mds:6800/e4cc09cf --> 
mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- 
0x563b321ad480 con 0

-3> 2018-09-26 18:42:32.743 7f70f98b5700  5 -- mds:6800/3838577103 >> 
mon:6789/0 conn(0x563b3213e000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx mon.2 seq 29 
0x563b321ab880 mdsbeaco
n(85106/mds2 down:damaged seq 311 v587) v7
-2> 2018-09-26 18:42:32.743 7f70f98b5700  1 -- mds:6800/3838577103 <== 
mon.2 mon:6789/0 29  mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 
 129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e
000
-1> 2018-09-26 18:42:32.743 7f70f98b5700  5 mds.beacon.mds2 
handle_mds_beacon down:damaged seq 311 rtt 0.038261
 0> 2018-09-26 18:42:32.743 7f70f28a7700  1 mds.mds2 respawn!

# cephfs-journal-tool --journal=purge_queue journal inspect
Overall journal integrity: DAMAGED
Corrupt regions:
  0x322ec65d9-

# cephfs-journal-tool --journal=purge_queue journal reset
old journal was 13470819801~8463
new journal start will be 13472104448 (1276184 bytes past old end)
writing journal head
done

# cephfs-journal-tool --journal=purge_queue journal inspect
2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object 500.0c8c
Overall journal integrity: DAMAGED
Objects missing:
  0xc8c
Corrupt regions:
  0x32300-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread Sergey Malinin
Now you also have PGs in 'creating' state. Creating PGs is very IO intensive 
operation.
To me, nothing special going on there - recovery + deep scrubbing + creating 
PGs results in expected degradation of performance.


September 25, 2018 2:32 PM, "by morphin"  wrote:

> 29 creating+down
> 4 stale+creating+down
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread Sergey Malinin
Settings that heavily affect recovery performance are:
osd_recovery_sleep
osd_recovery_sleep_[hdd|ssd]

See this for details:
http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/


September 25, 2018 1:57 PM, "by morphin"  wrote:

> Thank you for answer
> 
> What do you think the conf for speed the recover?
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread Sergey Malinin
Just let it recover.

  data:
pools:   1 pools, 4096 pgs
objects: 8.95 M objects, 17 TiB
usage:   34 TiB used, 577 TiB / 611 TiB avail
pgs: 94.873% pgs not active
 48475/17901254 objects degraded (0.271%)
 1/8950627 objects unfound (0.000%)
 2631 peering
 637  activating
 562  down
 159  active+clean
 44   activating+degraded
 30   active+recovery_wait+degraded
 12   activating+undersized+degraded
 10   active+recovering+degraded
 10   active+undersized+degraded
 1active+clean+scrubbing+deep

You've got deep scrubbed PGs which put considerable IO load on OSDs.

September 25, 2018 1:23 PM, "by morphin"  wrote:


> What should I do now?
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG inconsistent, "pg repair" not working

2018-09-25 Thread Sergey Malinin
# rados list-inconsistent-obj 1.92
{"epoch":519,"inconsistents":[]}

September 25, 2018 4:58 AM, "Brad Hubbard"  wrote:

> What does the output of the following command look like?
> 
> $ rados list-inconsistent-obj 1.92
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG inconsistent, "pg repair" not working

2018-09-24 Thread Sergey Malinin
Hello,
During normal operation our cluster suddenly thrown an error and since then we 
have had 1 inconsistent PG, and one of clients sharing cephfs mount has started 
to occasionally log "ceph: Failed to find inode X".
"ceph pg repair" deep scrubs the PG and fails with the same error in log.
Can anyone advise how to fix this?
log entry:
2018-09-20 06:48:23.081 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : 1.92 
soid 1:496296a8:::1000f44d0f4.0018:head: failed to pick suitable object info
2018-09-20 06:48:23.081 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : scrub 
1.92 1:496296a8:::1000f44d0f4.0018:head on disk size (3751936) does not 
match object info size (0) adjusted for ondisk to (0)
2018-09-20 06:50:36.925 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : 1.92 
scrub 3 errors

# ceph -v
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)
# ceph health detail
HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 3 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 1.92 is active+clean+inconsistent, acting [4,9]
# rados list-inconsistent-obj 1.92
{"epoch":519,"inconsistents":[]}
# ceph pg 1.92 query
{
"state": "active+clean+inconsistent",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 520,
"up": [
4,
9
],
"acting": [
4,
9
],
"acting_recovery_backfill": [
"4",
"9"
],
"info": {
"pgid": "1.92",
"last_update": "520'2456340",
"last_complete": "520'2456340",
"log_tail": "520'2453330",
"last_user_version": 7914566,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 63,
"epoch_pool_created": 63,
"last_epoch_started": 520,
"last_interval_started": 519,
"last_epoch_clean": 520,
"last_interval_clean": 519,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 519,
"same_interval_since": 519,
"same_primary_since": 514,
"last_scrub": "520'2456105",
"last_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_deep_scrub": "520'2456105",
"last_deep_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_clean_scrub_stamp": "2018-09-19 02:27:22.656268"
},
"stats": {
"version": "520'2456340",
"reported_seq": "6115579",
"reported_epoch": "520",
"state": "active+clean+inconsistent",
"last_fresh": "2018-09-25 03:02:34.338256",
"last_change": "2018-09-25 02:17:35.631476",
"last_active": "2018-09-25 03:02:34.338256",
"last_peered": "2018-09-25 03:02:34.338256",
"last_clean": "2018-09-25 03:02:34.338256",
"last_became_active": "2018-09-24 15:25:30.238044",
"last_became_peered": "2018-09-24 15:25:30.238044",
"last_unstale": "2018-09-25 03:02:34.338256",
"last_undegraded": "2018-09-25 03:02:34.338256",
"last_fullsized": "2018-09-25 03:02:34.338256",
"mapping_epoch": 519,
"log_start": "520'2453330",
"ondisk_log_start": "520'2453330",
"created": 63,
"last_epoch_clean": 520,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "520'2456105",
"last_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_deep_scrub": "520'2456105",
"last_deep_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_clean_scrub_stamp": "2018-09-19 02:27:22.656268",
"log_size": 3010,
"ondisk_log_size": 3010,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 23138366490,
"num_objects": 479532,
"num_object_clones": 0,
"num_object_copies": 959064,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 479532,
"num_whiteouts": 0,
"num_read": 3295720,
"num_read_kb": 63508374,
"num_write": 2495519,
"num_write_kb": 81795199,
"num_scrub_errors": 3,
"num_shallow_scrub_errors": 3,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 550,
"num_bytes_recovered": 15760916,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0,
"num_large_omap_objects": 0,
"num_objects_manifest": 0
},
"up": [
4,
9
],
"acting": [
4,
9
],
"blocked_by": [],
"up_primary": 4,
"acting_primary": 4,
"purged_snaps": []
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 520,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
},
"peer_info": [
{
"peer": "9",
"pgid": "1.92",
"last_update": "520'2456340",
"last_complete": "515'2438936",
"log_tail": "511'2435926",
"last_user_version": 7902301,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 63,
"epoch_pool_created": 63,
"last_epoch_started": 520,
"last_interval_started": 519,
"last_epoch_clean": 520,
"last_interval_clean": 

Re: [ceph-users] CephFS small files overhead

2018-09-04 Thread Sergey Malinin
You need to re-deploy OSDs for bluestore_min_alloc_size to take effect.

> On 4.09.2018, at 18:31, andrew w goussakovski  wrote:
> 
> Hello
> 
> We are trying to use cephfs as storage for web graphics, such as
> thumbnails and so on.
> Is there any way to reduse overhead on storage? On test cluster we have
> 1 fs, 2 pools (meta and data) with replica size = 2
> 
> objects: 1.02 M objects, 1.1 GiB
> usage:   144 GiB used, 27 GiB / 172 GiB avail
> 
> So we have (144/2)/1.1*100%=6500% overhead.
> 
> ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic
> (stable)
> osd storage - bluestore (changing bluestore_min_alloc_size makes no
> visible effect)
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Quota and ACL support

2018-08-27 Thread Sergey Malinin
It is supported in mainline kernel from elrepo. 
http://elrepo.org/tiki/tiki-index.php 

> On 27.08.2018, at 10:51, Oliver Freyermuth  
> wrote:
> 
> Dear Cephalopodians,
> 
> sorry if this is the wrong place to ask - but does somebody know if the 
> recently added quota support in the kernel client,
> and the ACL support, are going to be backported to RHEL 7 / CentOS 7 kernels? 
> Or can someone redirect me to the correct place to ask? 
> We don't have a RHEL subscription, but are using CentOS. 
> 
> These features are critical for us, so right now we use the Fuse client. My 
> hope is CentOS 8 will use a recent enough kernel
> to get those features automatically, though. 
> 
> Cheers and thanks,
>   Oliver
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS configuration for millions of small files

2018-07-30 Thread Sergey Malinin
Changing default 64k hdd min alloc size to 8k saved me 8 terabytes of disk 
space on cephfs with 150 million small files. You will need to redeploy OSDs 
for change to take effect.


> On 30.07.2018, at 22:37, Anton Aleksandrov  wrote:
> 
> Yes, that is what i see in my test in regard to space. Can min alloc size be 
> changed? 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic 13.2.1 release date

2018-07-23 Thread Sergey Malinin
Looks we're not getting it soon.
http://tracker.ceph.com/issues/24981 


> On 23.07.2018, at 13:45, Wido den Hollander  wrote:
> 
> Any news on this yet? 13.2.1 would be very welcome! :-)
> 
> Wido
> 
> On 07/09/2018 05:11 PM, Wido den Hollander wrote:
>> Hi,
>> 
>> Is there a release date for Mimic 13.2.1 yet?
>> 
>> There are a few issues which currently make deploying with Mimic 13.2.0
>> a bit difficult, for example:
>> 
>> - https://tracker.ceph.com/issues/24423
>> - https://github.com/ceph/ceph/pull/22393
>> 
>> Especially the first one makes it difficult.
>> 
>> 13.2.1 would be very welcome with these fixes in there.
>> 
>> Is there a ETA for this version yet?
>> 
>> Wido
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] unfound blocks IO or gives IO error?

2018-06-22 Thread Sergey Malinin
From http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/ 
 :

"Now 1 knows that these object exist, but there is no live ceph-osd who has a 
copy. In this case, IO to those objects will block, and the cluster will hope 
that the failed node comes back soon; this is assumed to be preferable to 
returning an IO error to the user."

> On 22.06.2018, at 16:16, Dan van der Ster  wrote:
> 
> Hi all,
> 
> Quick question: does an IO with an unfound object result in an IO
> error or should the IO block?
> 
> During a jewel to luminous upgrade some PGs passed through a state
> with unfound objects for a few seconds. And this seems to match the
> times when we had a few IO errors on RBD attached volumes.
> 
> Wondering what is the correct behaviour here...
> 
> Cheers, Dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Filestore -> Bluestore

2018-06-12 Thread Sergey Malinin
You should pass underlying device instead of DM volume to ceph-volume.
On Jun 12, 2018, 15:41 +0300, Alfredo Deza , wrote:
> On Tue, Jun 12, 2018 at 7:04 AM, Vadim Bulst  
> wrote:
> > I cannot release this lock! This is an expansion shelf connected with two
> > cables to the controller. If there is no multipath management, the os would
> > see every disk at least twice. Ceph has to deal with it somehow. I guess I'm
> > not the only one who has a setup like this.
> >
>
> Do you have an LV on top of that dm? We don't support multipath devices:
>
> http://docs.ceph.com/docs/master/ceph-volume/lvm/prepare/#multipath-support
>
> > Best,
> >
> > Vadim
> >
> >
> >
> >
> > On 12.06.2018 12:55, Alfredo Deza wrote:
> > >
> > > On Tue, Jun 12, 2018 at 6:47 AM, Vadim Bulst 
> > > wrote:
> > > >
> > > > Hi Alfredo,
> > > >
> > > > thanks for your help. Yust to make this clear /dev/dm-0 is the name of 
> > > > my
> > > > multipath disk:
> > > >
> > > > root@polstor01:/home/urzadmin# ls -la /dev/disk/by-id/ | grep dm-0
> > > > lrwxrwxrwx 1 root root 10 Jun 12 07:50 dm-name-35000c500866f8947 ->
> > > > ../../dm-0
> > > > lrwxrwxrwx 1 root root 10 Jun 12 07:50 dm-uuid-mpath-35000c500866f8947
> > > > ->
> > > > ../../dm-0
> > > > lrwxrwxrwx 1 root root 10 Jun 12 07:50 scsi-35000c500866f8947 ->
> > > > ../../dm-0
> > > > lrwxrwxrwx 1 root root 10 Jun 12 07:50 wwn-0x5000c500866f8947 ->
> > > > ../../dm-0
> > > >
> > > > If I run pvdisplay this device is not listed.
> > >
> > > Either way, you should not use dm devices directly. If this is a
> > > multipath disk, then you must use that other name instead of /dev/dm-*
> > >
> > > I am not sure what kind of setup you have, but that mapper must
> > > release its lock so that you can zap. We ensure that works with LVM, I
> > > am not sure
> > > how to do that in your environment.
> > >
> > > For example, with dmcrypt you get into similar issues, that is why we
> > > check crypsetup, so that we can make dmcrypt release that device
> > > before zapping.
> > > >
> > > > Cheers,
> > > >
> > > > Vadim
> > > >
> > > >
> > > >
> > > > On 12.06.2018 12:40, Alfredo Deza wrote:
> > > > >
> > > > > On Tue, Jun 12, 2018 at 4:37 AM, Vadim Bulst
> > > > > 
> > > > > wrote:
> > > > > >
> > > > > > no change:
> > > > > >
> > > > > >
> > > > > > root@polstor01:/home/urzadmin# ceph-volume lvm zap --destroy 
> > > > > > /dev/dm-0
> > > > > > --> Zapping: /dev/dm-0
> > > > >
> > > > > This is the problem right here. Your script is using the dm device
> > > > > that belongs to an LV.
> > > > >
> > > > > What you want to do here is destroy/zap the LV. Not the dm device that
> > > > > belongs to the LV.
> > > > >
> > > > > To make this clear in the future, I've created:
> > > > > http://tracker.ceph.com/issues/24504
> > > > >
> > > > >
> > > > >
> > > > > > Running command: /sbin/cryptsetup status /dev/mapper/
> > > > > > stdout: /dev/mapper/ is inactive.
> > > > > > --> Skipping --destroy because no associated physical volumes are 
> > > > > > found
> > > > > > for
> > > > > > /dev/dm-0
> > > > > > Running command: wipefs --all /dev/dm-0
> > > > > > stderr: wipefs: error: /dev/dm-0: probing initialization failed:
> > > > > > Device
> > > > > > or
> > > > > > resource busy
> > > > > > --> RuntimeError: command returned non-zero exit status: 1
> > > > > >
> > > > > >
> > > > > > On 12.06.2018 09:03, Linh Vu wrote:
> > > > > >
> > > > > > ceph-volume lvm zap --destroy $DEVICE
> > > > > >
> > > > > > 
> > > > > > From: ceph-users  on behalf of 
> > > > > > Vadim
> > > > > > Bulst 
> > > > > > Sent: Tuesday, 12 June 2018 4:46:44 PM
> &

Re: [ceph-users] Filestore -> Bluestore

2018-06-11 Thread Sergey Malinin
“Device or resource busy” error rises when no “--destroy” option is passed to 
ceph-volume.
On Jun 11, 2018, 22:44 +0300, Vadim Bulst , wrote:
> Dear Cephers,
>
> I'm trying to migrate our OSDs to Bluestore using this little script:
>
> #!/bin/bash
> HOSTNAME=$(hostname -s)
> OSDS=`ceph osd metadata | jq -c '[.[] | select(.osd_objectstore |
> contains("filestore")) ]' | jq '[.[] | select(.hostname |
> contains("'${HOSTNAME}'")) ]' | jq '.[].id'`
> IFS=' ' read -a OSDARRAY <<<$OSDS
> for OSD in "${OSDARRAY[@]}"; do
>   DEV=/dev/`ceph osd metadata | jq -c '.[] | select(.id=='${OSD}') |
> .backend_filestore_dev_node' | sed 's/"//g'`
>   echo "=== Migrating OSD nr ${OSD} on device ${DEV} ==="
>   ceph osd out ${OSD}
>     while ! ceph osd safe-to-destroy ${OSD} ; do echo "waiting for full
> evacuation"; sleep 60 ; done
>   systemctl stop ceph-osd@${OSD}
>   umount /var/lib/ceph/osd/ceph-${OSD}
>   /usr/sbin/ceph-volume lvm zap ${DEV}
>   ceph osd destroy ${OSD} --yes-i-really-mean-it
>   /usr/sbin/ceph-volume lvm create --bluestore --data ${DEV}
> --osd-id ${OSD}
> done
>
> Unfortunately - under normal circumstances this works flawlessly. In our
> case we have expansion shelfs connected as multipath devices to our nodes.
>
> /usr/sbin/ceph-volume lvm zap ${DEV}  is breaking with an error:
>
> OSD(s) 1 are safe to destroy without reducing data durability.
> --> Zapping: /dev/dm-0
> Running command: /sbin/cryptsetup status /dev/mapper/
>  stdout: /dev/mapper/ is inactive.
> Running command: wipefs --all /dev/dm-0
>  stderr: wipefs: error: /dev/dm-0: probing initialization failed:
> Device or resource busy
> -->  RuntimeError: command returned non-zero exit status: 1
> destroyed osd.1
> Running command: /usr/bin/ceph-authtool --gen-print-key
> Running command: /usr/bin/ceph --cluster ceph --name
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
> osd tree -f json
> Running command: /usr/bin/ceph --cluster ceph --name
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
> -i - osd new 74f6ff02-d027-4fc6-9b93-3a96d753
> 5c8f 1
> --> Was unable to complete a new OSD, will rollback changes
> --> OSD will be destroyed, keeping the ID because it was provided with
> --osd-id
> Running command: ceph osd destroy osd.1 --yes-i-really-mean-it
>  stderr: destroyed osd.1
>
> -->  RuntimeError: Cannot use device (/dev/dm-0). A vg/lv path or an
> existing device is needed
>
>
> Does anybody know how to solve this problem?
>
> Cheers,
>
> Vadim
>
> --
> Vadim Bulst
>
> Universität Leipzig / URZ
> 04109 Leipzig, Augustusplatz 10
>
> phone: +49-341-97-33380
> mail: vadim.bu...@uni-leipzig.de
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reinstall everything

2018-06-10 Thread Sergey Malinin
Not sure if ceph-deploy has similar functionality, but executing ‘ceph-volume 
lvm zap  --destroy’ on target machine would have removed lvm mapping.
On Jun 10, 2018, 14:41 +0300, Max Cuttins , wrote:
> I solved by myself.
> I' writing here my findings to save some working hours to others.
> Sound strange that nobody knew this.
> The issue is that data is purged but LVM partition are leaved in place.
> This means that you need to manually remove.
> I just reinstalled the whole OS and on the data disks there are still LVM 
> partition named "ceph-*". These partition are ACTIVE by default.
> To get rid of the old data:
> #find disks
> > lsblk
> See in the result all the "ceph-*" volume groups and remove theme:
> > vgchange -a n ceph-XX
> > vgremove ceph-XXX
> Do it for all disks.
> Now you can run ceph-deploy osd create correctly without being prompted that 
> disk is in use.
>
>
>
> Il 06/06/2018 19:41, Max Cuttins ha scritto:
> > Hi everybody,
> >
> > I would like to start from zero.
> > However last time I run the command to purge everything I got an issue.
> >
> > I had a complete cleaned up system as expected, but disk was still OSD and 
> > the new installation refused to overwrite disk in use.
> > The only way to make it work was manually format the disks with fdisk and 
> > zap again with ceph later.
> >
> > Is there something I shoulded do before purge everything in order to do not 
> > have similar issue?
> >
> > Thanks,
> > Max
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map hangs

2018-06-07 Thread Sergey Malinin
http://elrepo.org/tiki/kernel-ml  provides 
4.17

> On 7.06.2018, at 19:13, Tracy Reed  wrote:
> 
> It's what's shipping with CentOS/RHEL 7 and probably what the vast
> majority of people are using aside from perhaps the Ubuntu LTS people.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mimic: failed to load OSD map for epoch X, got 0 bytes

2018-06-04 Thread Sergey Malinin
Hello,

Freshly created OSD won't start after upgrading to mimic:


2018-06-04 17:00:23.135 7f48cbecb240  0 osd.3 0 done with init, starting boot 
process
2018-06-04 17:00:23.135 7f48cbecb240  1 osd.3 0 start_boot
2018-06-04 17:00:23.135 7f48cbecb240 10 osd.3 0 start_boot - have maps 0..0
2018-06-04 17:00:23.139 7f48bc625700 10 osd.3 0 OSD::ms_get_authorizer type=mgr
2018-06-04 17:00:23.139 7f48b07fa700 10 osd.3 0 ms_handle_connect con 
0x562b92aa2a00
2018-06-04 17:00:23.139 7f48a6bc0700 10 osd.3 0 _preboot _preboot mon has 
osdmaps 17056..17606
2018-06-04 17:00:23.139 7f48a6bc0700 20 osd.3 0 update_osd_stat osd_stat(1.0 
GiB used, 3.6 TiB avail, 3.6 TiB total, peers [] op hist [])
2018-06-04 17:00:23.139 7f48a6bc0700  5 osd.3 0 heartbeat: osd_stat(1.0 GiB 
used, 3.6 TiB avail, 3.6 TiB total, peers [] op hist [])
2018-06-04 17:00:23.139 7f48a6bc0700 -1 osd.3 0 waiting for initial osdmap
2018-06-04 17:00:23.139 7f48b07fa700 20 osd.3 0 OSD::ms_dispatch: 
osd_map(17056..17056 src has 17056..17606 +gap_removed_snaps) v4
2018-06-04 17:00:23.139 7f48b07fa700 10 osd.3 0 do_waiters -- start
2018-06-04 17:00:23.139 7f48b07fa700 10 osd.3 0 do_waiters -- finish
2018-06-04 17:00:23.139 7f48b07fa700 20 osd.3 0 _dispatch 0x562b9276de40 
osd_map(17056..17056 src has 17056..17606 +gap_removed_snaps) v4
2018-06-04 17:00:23.139 7f48b07fa700  3 osd.3 0 handle_osd_map epochs 
[17056,17056], i have 0, src has [17056,17606]
2018-06-04 17:00:23.139 7f48b07fa700 10 osd.3 0 handle_osd_map message skips 
epochs 1..17055
2018-06-04 17:00:23.139 7f48b07fa700 10 osd.3 0 handle_osd_map  got full map 
for epoch 17056
2018-06-04 17:00:23.139 7f48b07fa700 20 osd.3 0 got_full_map 17056, nothing 
requested
2018-06-04 17:00:23.139 7f48b07fa700 20 osd.3 0 get_map 17055 - loading and 
decoding 0x562b92aea480
2018-06-04 17:00:23.139 7f48b07fa700 -1 osd.3 0 failed to load OSD map for 
epoch 17055, got 0 bytes
2018-06-04 17:00:23.147 7f48b07fa700 -1 /build/ceph-13.2.0/src/osd/OSD.h: In 
function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f48b07fa700 time 
2018-06-04 17:00:23.144480
/build/ceph-13.2.0/src/osd/OSD.h: 828: FAILED assert(ret)

ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) 
[0x7f48c32f35e2]
2: (()+0x26b7a7) [0x7f48c32f37a7]
3: (OSDService::get_map(unsigned int)+0x4a) [0x562b90410e9a]
4: (OSD::handle_osd_map(MOSDMap*)+0xfb1) [0x562b903b7dc1]
5: (OSD::_dispatch(Message*)+0xa1) [0x562b903c0a21]
6: (OSD::ms_dispatch(Message*)+0x56) [0x562b903c0d76]
7: (DispatchQueue::entry()+0xb92) [0x7f48c336c452]
8: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f48c340a6cd]
9: (()+0x76db) [0x7f48c19ee6db]
10: (clone()+0x3f) [0x7f48c09b288f]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.5 Luminous released

2018-05-01 Thread Sergey Malinin
Useless due to http://tracker.ceph.com/issues/22102 



> On 24.04.2018, at 23:29, Abhishek  wrote:
> 
> We're glad to announce the fifth bugfix release of Luminous v12.2.x long term 
> stable

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Sergey Malinin
Forgot to mention, that in my setup the issue gone when I had reverted back to 
single MDS and switched dirfrag off. 


On Monday, March 19, 2018 at 18:45, Nicolas Huillard wrote:

> Then I tried to reduce the number of MDS, from 4 to 1, 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Sergey Malinin
Default for mds_log_events_per_segment is 1024, in my set up I ended up with 
8192.
I calculated that value like IOPS / log segments * 5 seconds (afaik MDS 
performs journal maintenance once in 5 seconds by default).


On Monday, March 19, 2018 at 15:20, Nicolas Huillard wrote:

> I can't find any doc about that mds_log_events_per_segment setting,
> specially on how to choose a good value.
> Can you elaborate on "original value multiplied several times" ?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Sergey Malinin
I experienced the same issue and was able to reduce metadata writes by raising 
mds_log_events_per_segment to it’s original value multiplied several times.

From: ceph-users  on behalf of Nicolas 
Huillard 
Sent: Monday, March 19, 2018 12:01:09 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Huge amount of cephfs metadata writes while only reading 
data (rsync from storage, to single disk)

Hi all,

I'm experimenting with a new little storage cluster. I wanted to take
advantage of the week-end to copy all data (1TB, 10M objects) from the
cluster to a single SATA disk. I expected to saturate the SATA disk
while writing to it, but the storage cluster actually saturates its
network links, while barely writing to the destination disk (63GB
written in 20h, that's less than 1MBps).

Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, Luminous
12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link
between datacenters (12ms latency). 4 clients using a single cephfs
storing data + metadata on the same spinning disks with bluestore.

Test : I'm using a single rsync on one of the client servers (the other
3 are just sitting there). rsync is local to the client, copying from
the cephfs mount (kernel client on 4.14 from stretch-backports, just to
use a potentially more recent cephfs client than on stock 4.9), to the
SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on
deep directory branches, along with some large files (10-100MB) in a
few directories. There is no other activity on the cluster.

Observations : I initially saw write performance on the destination
disk from a few 100kBps (during exploration of branches with tiny file)
to a few 10MBps (while copying large files), essentially seeing the
file names scrolling at a relatively fixed rate, unrelated to their
individual size.
After 5 hours, the fibre link stated to saturate at 200Mbps, while
destination disk writes is down to a few 10kBps.

Using the dashboard, I see lots of metadata writes, at 30MBps rate on
the metadata pool, which correlates to the 200Mbps link rate.
It also shows regular "Health check failed: 1 MDSs behind on trimming
(MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming (64/30)".

I wonder why cephfs would write anything to the metadata (I'm mounting
on the clients with "noatime"), while I'm just reading data from it...
What could I tune to reduce that write-load-while-reading-only ?

--
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs MDS slow requests

2018-03-15 Thread Sergey Malinin
On Friday, March 16, 2018 at 00:07, Deepak Naidu wrote:
> cephFS is not great for small files(in KB’s) but works great with large file 
> sizes(MB or GB’s). So using like filer(NFS/SMB) use-case needs administration 
> attention.
>  

Got to disagree with you there. CephFS (Luminous) performs perfectly in our 
setup: single MDS with 8 GB memory cache and dirfrag switched off, pure SSD 
metadata pool, 4 clients randomly accessing 90 million files mostly under 30kb 
size.
Reading large directories (500k+ dentries) and unlinking thousands files at 
once is fast as hell.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous v12.2.3 released

2018-02-21 Thread Sergey Malinin
Sadly, have to keep going with http://tracker.ceph.com/issues/22510






On Wednesday, February 21, 2018 at 22:50, Abhishek Lekshmanan wrote:

> We're happy to announce the third bugfix release of Luminous v12.2.x 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-mgr Python error with prometheus plugin

2018-02-17 Thread Sergey Malinin
All I got with script replacement is the following:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, in 
respond
response.body = self.handler()
  File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 217, 
in __call__
self.body = self.oldhandler(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61, in 
__call__
return self.callable(*self.args, **self.kwargs)
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 409, in metrics
if global_instance().have_mon_connection():
AttributeError: 'Module' object has no attribute 'have_mon_connection'


I'm not familiar with Python, so any advise is much appreciated.


On Friday, February 16, 2018 at 12:10, Konstantin Shalygin wrote:

> > i just try to get the prometheus plugin up and runing
> 
> 
> 
> Use module from master.
> 
> From this commit should work with 12.2.2, just wget it and replace 
> stock module.
> 
> https://github.com/ceph/ceph/blob/d431de74def1b8889ad568ab99436362833d063e/src/pybind/mgr/prometheus/module.py
> 
> 
> 
> 
> k
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High apply latency

2018-01-31 Thread Sergey Malinin
Deep scrub is I/O-expensive. If deep scrub is unnecessary, you can disable it 
with "ceph osd pool set  nodeep-scrub". 


On Thursday, February 1, 2018 at 00:10, Jakub Jaszewski wrote:

>  3active+clean+scrubbing+deep 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] BlueStore "allocate failed, wtf" error

2018-01-29 Thread Sergey Malinin
Hello, 
Can anyone help me interpret the below error, which one of our OSDs has been 
occasionally throwing since last night.
Thanks.

-
Jan 29 03:00:53 osd-host ceph-osd[10964]: 2018-01-29 03:00:53.509185 
7fe4ae431700 -1 bluestore(/var/lib/ceph/osd/ceph-9) _balance_bluefs_freespace 
allocate failed on 0x8000 min_alloc_size 0x4000
Jan 29 03:00:57 osd-host ceph-osd[10964]: 
/build/ceph-12.2.2/src/os/bluestore/BlueStore.cc: In function 'int 
BlueStore::_balance_bluefs_freespace(PExtentVector*)' thread 7fe4ae431700 time 
2018-01-29 03:00:57.736207
Jan 29 03:00:57 osd-host ceph-osd[10964]: 
/build/ceph-12.2.2/src/os/bluestore/BlueStore.cc: 4939: FAILED assert(0 == 
"allocate failed, wtf")
Jan 29 03:00:57 osd-host ceph-osd[10964]:  ceph version 12.2.2 
(cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
Jan 29 03:00:57 osd-host ceph-osd[10964]:  1: (ceph::__ceph_assert_fail(char 
const*, char const*, int, char const*)+0x102) [0x55e88d2ae892]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  2: 
(BlueStore::_balance_bluefs_freespace(std::vector 
>*)+0x1b21) [0x55e88d1405c1]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  3: 
(BlueStore::_kv_sync_thread()+0x1ac0) [0x55e88d143040]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  4: 
(BlueStore::KVSyncThread::entry()+0xd) [0x55e88d186f8d]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  5: (()+0x76da) [0x7fe4bec766da]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  6: (clone()+0x5f) [0x7fe4bdce8d7f]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  NOTE: a copy of the executable, or 
`objdump -rdS ` is needed to interpret this.
Jan 29 03:00:57 osd-host ceph-osd[10964]: 2018-01-29 03:00:57.741058 
7fe4ae431700 -1 /build/ceph-12.2.2/src/os/bluestore/BlueStore.cc: In function 
'int BlueStore::_balance_bluefs_freespace(PExtentVector*)' thread 7fe4ae431700 
time 201
8-01-29 03:00:57.736207
Jan 29 03:00:57 osd-host ceph-osd[10964]: 
/build/ceph-12.2.2/src/os/bluestore/BlueStore.cc: 4939: FAILED assert(0 == 
"allocate failed, wtf")
Jan 29 03:00:57 osd-host ceph-osd[10964]:  ceph version 12.2.2 
(cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
Jan 29 03:00:57 osd-host ceph-osd[10964]:  1: (ceph::__ceph_assert_fail(char 
const*, char const*, int, char const*)+0x102) [0x55e88d2ae892]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  2: 
(BlueStore::_balance_bluefs_freespace(std::vector 
>*)+0x1b21) [0x55e88d1405c1]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  3: 
(BlueStore::_kv_sync_thread()+0x1ac0) [0x55e88d143040]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  4: 
(BlueStore::KVSyncThread::entry()+0xd) [0x55e88d186f8d]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  5: (()+0x76da) [0x7fe4bec766da]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  6: (clone()+0x5f) [0x7fe4bdce8d7f]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  NOTE: a copy of the executable, or 
`objdump -rdS ` is needed to interpret this.
Jan 29 03:00:57 osd-host ceph-osd[10964]:  0> 2018-01-29 03:00:57.741058 
7fe4ae431700 -1 /build/ceph-12.2.2/src/os/bluestore/BlueStore.cc: In function 
'int BlueStore::_balance_bluefs_freespace(PExtentVector*)' thread 7fe4ae431700 
time 2018-01-29 03:00:57.736207
Jan 29 03:00:57 osd-host ceph-osd[10964]: 
/build/ceph-12.2.2/src/os/bluestore/BlueStore.cc: 4939: FAILED assert(0 == 
"allocate failed, wtf")
Jan 29 03:00:57 osd-host ceph-osd[10964]:  ceph version 12.2.2 
(cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
Jan 29 03:00:57 osd-host ceph-osd[10964]:  1: (ceph::__ceph_assert_fail(char 
const*, char const*, int, char const*)+0x102) [0x55e88d2ae892]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  2: 
(BlueStore::_balance_bluefs_freespace(std::vector 
>*)+0x1b21) [0x55e88d1405c1]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  3: 
(BlueStore::_kv_sync_thread()+0x1ac0) [0x55e88d143040]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  4: 
(BlueStore::KVSyncThread::entry()+0xd) [0x55e88d186f8d]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  5: (()+0x76da) [0x7fe4bec766da]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  6: (clone()+0x5f) [0x7fe4bdce8d7f]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  NOTE: a copy of the executable, or 
`objdump -rdS ` is needed to interpret this.
Jan 29 03:00:57 osd-host ceph-osd[10964]: *** Caught signal (Aborted) **
Jan 29 03:00:57 osd-host ceph-osd[10964]:  in thread 7fe4ae431700 
thread_name:bstore_kv_sync
Jan 29 03:00:57 osd-host ceph-osd[10964]:  ceph version 12.2.2 
(cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
Jan 29 03:00:57 osd-host ceph-osd[10964]:  1: (()+0xa65824) [0x55e88d26b824]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  2: (()+0x11670) [0x7fe4bec80670]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  3: (gsignal()+0x9f) [0x7fe4bdc1577f]
Jan 29 03:00:57 osd-host ceph-osd[10964]:  4: 

Re: [ceph-users] How to speed up backfill

2018-01-10 Thread Sergey Malinin
It is also worth looking at osd_recovery_sleep option.


From: ceph-users  on behalf of Josef Zelenka 

Sent: Thursday, January 11, 2018 12:07:45 AM
To: shadow_lin
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] How to speed up backfill



On 10/01/18 21:53, Josef Zelenka wrote:

Hi, i had the same issue a few days back, i tried playing around with these two:

ceph tell 'osd.*' injectargs '--osd-max-backfills '
ceph tell 'osd.*' injectargs '--osd-recovery-max-active  '
 and it helped greatly(increased our recovery speed 20x), but be careful to not 
overload your systems.


On 10/01/18 17:50, shadow_lin wrote:
Hi all,
I am playing with setting for backfill to try to find how to control the speed 
of backfill.

Now I only find  "osd max backfills" can have effect the backfill speed. But 
after all pg need to be backfilled begin backfilling I can't find any way to 
speed up backfills.

Especailly when it comes to the last pg to recover, the speed is only a few 
MB/s(when there are multi pg are backfilled the speed could be more than 
600MB/s in my test)

I am a little confused about the setting of backfills and recovery.Though 
backfilling is a kind of recovery but It seems recovery setting is only about 
to replay pg logs to do recover  pg.

Would change "osd recovery max active" or other recovery setting have any 
effect on backfilling?

I did tried "osd recovery op priority" and "osd recovery max active" with no 
luck.

Any advice would be greatly appreciated.Thanks

2018-01-11

lin.yunfan



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs degraded on ceph luminous 12.2.2

2018-01-08 Thread Sergey Malinin
You cannot force mds quit "replay" state for obvious reason of keeping data 
consistent. You might raise mds_beacon_grace to a somewhat reasonable value 
that would allow MDS to replay the journal without being marked laggy and 
eventually blacklisted.


From: ceph-users  on behalf of Alessandro De 
Salvo 
Sent: Monday, January 8, 2018 7:40:59 PM
To: Lincoln Bryant; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] cephfs degraded on ceph luminous 12.2.2

Thanks Lincoln,

indeed, as I said the cluster is recovering, so there are pending ops:


 pgs: 21.034% pgs not active
  1692310/24980804 objects degraded (6.774%)
  5612149/24980804 objects misplaced (22.466%)
  458 active+clean
  329 active+remapped+backfill_wait
  159 activating+remapped
  100 active+undersized+degraded+remapped+backfill_wait
  58  activating+undersized+degraded+remapped
  27  activating
  22  active+undersized+degraded+remapped+backfilling
  6   active+remapped+backfilling
  1   active+recovery_wait+degraded


If it's just a matter to wait for the system to complete the recovery
it's fine, I'll deal with that, but I was wondendering if there is a
more suble problem here.

OK, I'll wait for the recovery to complete and see what happens, thanks.

Cheers,


 Alessandro


Il 08/01/18 17:36, Lincoln Bryant ha scritto:
> Hi Alessandro,
>
> What is the state of your PGs? Inactive PGs have blocked CephFS
> recovery on our cluster before. I'd try to clear any blocked ops and
> see if the MDSes recover.
>
> --Lincoln
>
> On Mon, 2018-01-08 at 17:21 +0100, Alessandro De Salvo wrote:
>> Hi,
>>
>> I'm running on ceph luminous 12.2.2 and my cephfs suddenly degraded.
>>
>> I have 2 active mds instances and 1 standby. All the active
>> instances
>> are now in replay state and show the same error in the logs:
>>
>>
>>  mds1 
>>
>> 2018-01-08 16:04:15.765637 7fc2e92451c0  0 ceph version 12.2.2
>> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),
>> process
>> (unknown), pid 164
>> starting mds.mds1 at -
>> 2018-01-08 16:04:15.785849 7fc2e92451c0  0 pidfile_write: ignore
>> empty
>> --pid-file
>> 2018-01-08 16:04:20.168178 7fc2e1ee1700  1 mds.mds1 handle_mds_map
>> standby
>> 2018-01-08 16:04:20.278424 7fc2e1ee1700  1 mds.1.20635 handle_mds_map
>> i
>> am now mds.1.20635
>> 2018-01-08 16:04:20.278432 7fc2e1ee1700  1 mds.1.20635
>> handle_mds_map
>> state change up:boot --> up:replay
>> 2018-01-08 16:04:20.278443 7fc2e1ee1700  1 mds.1.20635 replay_start
>> 2018-01-08 16:04:20.278449 7fc2e1ee1700  1 mds.1.20635  recovery set
>> is 0
>> 2018-01-08 16:04:20.278458 7fc2e1ee1700  1 mds.1.20635  waiting for
>> osdmap 21467 (which blacklists prior instance)
>>
>>
>>  mds2 
>>
>> 2018-01-08 16:04:16.870459 7fd8456201c0  0 ceph version 12.2.2
>> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),
>> process
>> (unknown), pid 295
>> starting mds.mds2 at -
>> 2018-01-08 16:04:16.881616 7fd8456201c0  0 pidfile_write: ignore
>> empty
>> --pid-file
>> 2018-01-08 16:04:21.274543 7fd83e2bc700  1 mds.mds2 handle_mds_map
>> standby
>> 2018-01-08 16:04:21.314438 7fd83e2bc700  1 mds.0.20637 handle_mds_map
>> i
>> am now mds.0.20637
>> 2018-01-08 16:04:21.314459 7fd83e2bc700  1 mds.0.20637
>> handle_mds_map
>> state change up:boot --> up:replay
>> 2018-01-08 16:04:21.314479 7fd83e2bc700  1 mds.0.20637 replay_start
>> 2018-01-08 16:04:21.314492 7fd83e2bc700  1 mds.0.20637  recovery set
>> is 1
>> 2018-01-08 16:04:21.314517 7fd83e2bc700  1 mds.0.20637  waiting for
>> osdmap 21467 (which blacklists prior instance)
>> 2018-01-08 16:04:21.393307 7fd837aaf700  0 mds.0.cache creating
>> system
>> inode with ino:0x100
>> 2018-01-08 16:04:21.397246 7fd837aaf700  0 mds.0.cache creating
>> system
>> inode with ino:0x1
>>
>> The cluster is recovering as we are changing some of the osds, and
>> there
>> are a few slow/stuck requests, but I'm not sure if this is the cause,
>> as
>> there is apparently no data loss (until now).
>>
>> How can I force the MDSes to quit the replay state?
>>
>> Thanks for any help,
>>
>>
>>   Alessandro
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS cache size limits

2018-01-08 Thread Sergey Malinin
In my experience 1GB cache could hold roughly 400k inodes.

_
From: Marc Roos 
Sent: Monday, January 8, 2018 23:02
Subject: Re: [ceph-users] MDS cache size limits
To: pdonnell , stefan 
Cc: ceph-users 



I guess the mds cache holds files, attributes etc but how many files
will the default "mds_cache_memory_limit": "1073741824" hold?



-Original Message-
From: Stefan Kooman [mailto:ste...@bit.nl]
Sent: vrijdag 5 januari 2018 12:54
To: Patrick Donnelly
Cc: Ceph Users
Subject: Re: [ceph-users] MDS cache size limits

Quoting Patrick Donnelly (pdonn...@redhat.com):
>
> It's expected but not desired: http://tracker.ceph.com/issues/21402
>
> The memory usage tracking is off by a constant factor. I'd suggest
> just lowering the limit so it's about where it should be for your
> system.

Thanks for the info. Yeah, we did exactly that (observe and adjust
setting accordingly). Is this something worth mentioning in the
documentation? Escpecially when this "factor" is a constant? Over time
(with issue 21402 being worked on) things will change. Ceph operators
will want to make use of as much cache as possible without
overcommitting (MDS won't notice until there is no more memory left,
restart, and looses all its cache :/).

Gr. Stefan

--
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "ceph -s" shows no osds

2018-01-05 Thread Sergey Malinin
1.5.39 can be installed from “luminous” repo:
http://docs.ceph.com/docs/master/install/get-packages/


From: Hüseyin Atatür YILDIRIM <hyildi...@havelsan.com.tr>
Sent: Friday, January 5, 2018 9:16:48 AM
To: Sergey Malinin; ceph-users@lists.ceph.com
Subject: RE: [ceph-users] "ceph -s" shows no osds

Thanks a lot Sergey. I searched about the upgrade of ceph-deploy and found out  
“pip install”  is the most reasonable one; normal software repo install (i.e.  
sudo apt install ceph-deploy)  always installs version 1.32.
Do you agree with this ?

Regards,
Atatür
From: Sergey Malinin [mailto:h...@newmail.com]
Sent: Thursday, January 4, 2018 3:46 PM
To: Hüseyin Atatür YILDIRIM <hyildi...@havelsan.com.tr>; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] "ceph -s" shows no osds

Mgr installation was introduced in 1.5.38, you need to upgrade ceph-deploy.


From: Hüseyin Atatür YILDIRIM 
<hyildi...@havelsan.com.tr<mailto:hyildi...@havelsan.com.tr>>
Sent: Thursday, January 4, 2018 2:01:57 PM
To: Sergey Malinin; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] "ceph -s" shows no osds

Hi,


ceph-deploy --version
1.5.32

Thank you,
Atatür
From: Sergey Malinin [mailto:h...@newmail.com]
Sent: Thursday, January 4, 2018 12:51 PM
To: Hüseyin Atatür YILDIRIM 
<hyildi...@havelsan.com.tr<mailto:hyildi...@havelsan.com.tr>>; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] "ceph -s" shows no osds

What is you “ceph-deploy --version” ?


From: Hüseyin Atatür YILDIRIM 
<hyildi...@havelsan.com.tr<mailto:hyildi...@havelsan.com.tr>>
Sent: Thursday, January 4, 2018 9:14:39 AM
To: Sergey Malinin; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] "ceph -s" shows no osds

Hello Sergey,

I issued the mgr create command and it fails with

ceph-deploy  mgr create mon01
usage: ceph-deploy [-h] [-v | -q] [--version] [--username USERNAME]
   [--overwrite-conf] [--cluster NAME] [--ceph-conf CEPH_CONF]
   COMMAND ...
ceph-deploy: error: argument COMMAND: invalid choice: 'mgr' (choose from 'new', 
'install', 'rgw', 'mon', 'mds', 'gatherkeys', 'disk', 'osd', 'admin', 'repo', 
'config', 'uninstall', 'purge', 'purgedata', 'calamari', 'forgetkeys', 'pkg')

Any ideas?

Thank you..

From: Sergey Malinin [mailto:h...@newmail.com]
Sent: Wednesday, January 3, 2018 5:56 PM
To: Hüseyin Atatür YILDIRIM 
<hyildi...@havelsan.com.tr<mailto:hyildi...@havelsan.com.tr>>; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] "ceph -s" shows no osds

What version are you using? Luminous needs mgr daemons running.


From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Hüseyin Atatür YILDIRIM 
<hyildi...@havelsan.com.tr<mailto:hyildi...@havelsan.com.tr>>
Sent: Wednesday, January 3, 2018 5:15:30 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] "ceph -s" shows no osds


Hello,

I try to setup Ceph cluster on Ubuntu 16.04. I’ve setup 1 monitor-osd (hostname 
mon01) and 2 osd hosts (osd01 and osd02). At one stage, I issued

   ceph-deploy  osd create mon01:sdb1 osd01:sdb1 osd02:sdb1

and ran successfully.  But when I issued below from the admin host:

ssh mon01 sudo ceph –s

cluster 9c7303db-56ab-4ddf-9fb8-1882754a4411
 health HEALTH_ERR
64 pgs are stuck inactive for more than 300 seconds
64 pgs stuck inactive
64 pgs stuck unclean
no osds
 monmap e1: 1 mons at {mon01=192.168.122.158:6789/0}
election epoch 4, quorum 0 mon01
 osdmap e1: 0 osds: 0 up, 0 in
flags sortbitwise,require_jewel_osds
  pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
  64 creating


There’s no  osds in the cluster. Can you please help?

Regards,
Atatur

[cid:image001.png@01D38605.E7F2BDC0]<http://www.havelsan.com.tr>



Hüseyin Atatür YILDIRIM
SİSTEM MÜHENDİSİ

Üniversiteler Mah. İhsan Doğramacı Bul. ODTÜ Teknokent Havelsan A.Ş. 23/B 
Çankaya Ankara TÜRKİYE

[cid:image003.png@01D38605.E7F2BDC0]

+90 312 292 74 00

[cid:image004.png@01D38605.E7F2BDC0]

+90 312 219 57 97



[cid:image005.jpg@01D38605.E7F2BDC0]

YASAL UYARI: Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul 
ve Şartlar dokümanına tabidir. 
<http://www.havelsan.com.tr/TR/Main/icerik/4513/mail-yasal-uyari>

LEGAL NOTICE: This e-mail is subject to the Terms and Conditions document which 
can be accessed with this link. 
<http://www.havelsan.com.tr/TR/Main/icerik/4

Re: [ceph-users] data cleaup/disposal process

2018-01-04 Thread Sergey Malinin
http://cephnotes.ksperis.com/blog/2014/07/04/remove-big-rbd-image


From: ceph-users  on behalf of M Ranga Swami 
Reddy 
Sent: Thursday, January 4, 2018 3:55:27 PM
To: ceph-users; ceph-devel
Subject: [ceph-users] data cleaup/disposal process

Hello,
In Ceph, is the way to cleanup data before deleting an image?

Means wipe the data with '0' before deleting an image.

Please let me know if you have any suggestions here.

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "ceph -s" shows no osds

2018-01-04 Thread Sergey Malinin
Mgr installation was introduced in 1.5.38, you need to upgrade ceph-deploy.


From: Hüseyin Atatür YILDIRIM <hyildi...@havelsan.com.tr>
Sent: Thursday, January 4, 2018 2:01:57 PM
To: Sergey Malinin; ceph-users@lists.ceph.com
Subject: RE: [ceph-users] "ceph -s" shows no osds

Hi,


ceph-deploy --version
1.5.32

Thank you,
Atatür
From: Sergey Malinin [mailto:h...@newmail.com]
Sent: Thursday, January 4, 2018 12:51 PM
To: Hüseyin Atatür YILDIRIM <hyildi...@havelsan.com.tr>; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] "ceph -s" shows no osds

What is you “ceph-deploy --version” ?


From: Hüseyin Atatür YILDIRIM 
<hyildi...@havelsan.com.tr<mailto:hyildi...@havelsan.com.tr>>
Sent: Thursday, January 4, 2018 9:14:39 AM
To: Sergey Malinin; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] "ceph -s" shows no osds

Hello Sergey,

I issued the mgr create command and it fails with

ceph-deploy  mgr create mon01
usage: ceph-deploy [-h] [-v | -q] [--version] [--username USERNAME]
   [--overwrite-conf] [--cluster NAME] [--ceph-conf CEPH_CONF]
   COMMAND ...
ceph-deploy: error: argument COMMAND: invalid choice: 'mgr' (choose from 'new', 
'install', 'rgw', 'mon', 'mds', 'gatherkeys', 'disk', 'osd', 'admin', 'repo', 
'config', 'uninstall', 'purge', 'purgedata', 'calamari', 'forgetkeys', 'pkg')

Any ideas?

Thank you..

From: Sergey Malinin [mailto:h...@newmail.com]
Sent: Wednesday, January 3, 2018 5:56 PM
To: Hüseyin Atatür YILDIRIM 
<hyildi...@havelsan.com.tr<mailto:hyildi...@havelsan.com.tr>>; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] "ceph -s" shows no osds

What version are you using? Luminous needs mgr daemons running.


From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Hüseyin Atatür YILDIRIM 
<hyildi...@havelsan.com.tr<mailto:hyildi...@havelsan.com.tr>>
Sent: Wednesday, January 3, 2018 5:15:30 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] "ceph -s" shows no osds


Hello,

I try to setup Ceph cluster on Ubuntu 16.04. I’ve setup 1 monitor-osd (hostname 
mon01) and 2 osd hosts (osd01 and osd02). At one stage, I issued

   ceph-deploy  osd create mon01:sdb1 osd01:sdb1 osd02:sdb1

and ran successfully.  But when I issued below from the admin host:

ssh mon01 sudo ceph –s

cluster 9c7303db-56ab-4ddf-9fb8-1882754a4411
 health HEALTH_ERR
64 pgs are stuck inactive for more than 300 seconds
64 pgs stuck inactive
64 pgs stuck unclean
no osds
 monmap e1: 1 mons at {mon01=192.168.122.158:6789/0}
election epoch 4, quorum 0 mon01
 osdmap e1: 0 osds: 0 up, 0 in
flags sortbitwise,require_jewel_osds
  pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
  64 creating


There’s no  osds in the cluster. Can you please help?

Regards,
Atatur

[cid:image001.png@01D38564.91FFCA80]<http://www.havelsan.com.tr>



Hüseyin Atatür YILDIRIM
SİSTEM MÜHENDİSİ

Üniversiteler Mah. İhsan Doğramacı Bul. ODTÜ Teknokent Havelsan A.Ş. 23/B 
Çankaya Ankara TÜRKİYE

[cid:image003.png@01D38564.91FFCA80]

+90 312 292 74 00

[cid:image004.png@01D38564.91FFCA80]

+90 312 219 57 97



[cid:image005.jpg@01D38564.91FFCA80]

YASAL UYARI: Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul 
ve Şartlar dokümanına tabidir. 
<http://www.havelsan.com.tr/TR/Main/icerik/4513/mail-yasal-uyari>

LEGAL NOTICE: This e-mail is subject to the Terms and Conditions document which 
can be accessed with this link. 
<http://www.havelsan.com.tr/TR/Main/icerik/4513/mail-yasal-uyari>

[http://www.havelsan.com.tr/Library/images/mail/email.jpg]

Lütfen gerekmedikçe bu sayfa ve eklerini yazdırmayınız / Please consider the 
environment before printing this email




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "ceph -s" shows no osds

2018-01-04 Thread Sergey Malinin
What is you “ceph-deploy --version” ?


From: Hüseyin Atatür YILDIRIM <hyildi...@havelsan.com.tr>
Sent: Thursday, January 4, 2018 9:14:39 AM
To: Sergey Malinin; ceph-users@lists.ceph.com
Subject: RE: [ceph-users] "ceph -s" shows no osds

Hello Sergey,

I issued the mgr create command and it fails with

ceph-deploy  mgr create mon01
usage: ceph-deploy [-h] [-v | -q] [--version] [--username USERNAME]
   [--overwrite-conf] [--cluster NAME] [--ceph-conf CEPH_CONF]
   COMMAND ...
ceph-deploy: error: argument COMMAND: invalid choice: 'mgr' (choose from 'new', 
'install', 'rgw', 'mon', 'mds', 'gatherkeys', 'disk', 'osd', 'admin', 'repo', 
'config', 'uninstall', 'purge', 'purgedata', 'calamari', 'forgetkeys', 'pkg')

Any ideas?

Thank you..

From: Sergey Malinin [mailto:h...@newmail.com]
Sent: Wednesday, January 3, 2018 5:56 PM
To: Hüseyin Atatür YILDIRIM <hyildi...@havelsan.com.tr>; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] "ceph -s" shows no osds

What version are you using? Luminous needs mgr daemons running.


From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Hüseyin Atatür YILDIRIM 
<hyildi...@havelsan.com.tr<mailto:hyildi...@havelsan.com.tr>>
Sent: Wednesday, January 3, 2018 5:15:30 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] "ceph -s" shows no osds


Hello,

I try to setup Ceph cluster on Ubuntu 16.04. I’ve setup 1 monitor-osd (hostname 
mon01) and 2 osd hosts (osd01 and osd02). At one stage, I issued

   ceph-deploy  osd create mon01:sdb1 osd01:sdb1 osd02:sdb1

and ran successfully.  But when I issued below from the admin host:

ssh mon01 sudo ceph –s

cluster 9c7303db-56ab-4ddf-9fb8-1882754a4411
 health HEALTH_ERR
64 pgs are stuck inactive for more than 300 seconds
64 pgs stuck inactive
64 pgs stuck unclean
no osds
 monmap e1: 1 mons at {mon01=192.168.122.158:6789/0}
election epoch 4, quorum 0 mon01
 osdmap e1: 0 osds: 0 up, 0 in
flags sortbitwise,require_jewel_osds
  pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
  64 creating


There’s no  osds in the cluster. Can you please help?

Regards,
Atatur

[cid:image001.png@01D3853C.6EC4A770]<http://www.havelsan.com.tr>



Hüseyin Atatür YILDIRIM
SİSTEM MÜHENDİSİ

Üniversiteler Mah. İhsan Doğramacı Bul. ODTÜ Teknokent Havelsan A.Ş. 23/B 
Çankaya Ankara TÜRKİYE

[cid:image003.png@01D3853C.6EC4A770]

+90 312 292 74 00

[cid:image004.png@01D3853C.6EC4A770]

+90 312 219 57 97



[cid:image005.jpg@01D3853C.6EC4A770]

YASAL UYARI: Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul 
ve Şartlar dokümanına tabidir. 
<http://www.havelsan.com.tr/TR/Main/icerik/4513/mail-yasal-uyari>

LEGAL NOTICE: This e-mail is subject to the Terms and Conditions document which 
can be accessed with this link. 
<http://www.havelsan.com.tr/TR/Main/icerik/4513/mail-yasal-uyari>

[http://www.havelsan.com.tr/Library/images/mail/email.jpg]

Lütfen gerekmedikçe bu sayfa ve eklerini yazdırmayınız / Please consider the 
environment before printing this email




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph luminous - SSD partitions disssapeared

2018-01-03 Thread Sergey Malinin
To make device ownership persist over reboots, you can to set up udev rules.
The article you referenced seems to have nothing to do with bluestore. When you 
had zapped /dev/sda, you zapped bluestore metadata stored on db partition so 
newly created partitions, if they were created apart from block storage, are no 
longer relevant and that’s why osd daemon throws error.


From: Steven Vacaroaia <ste...@gmail.com>
Sent: Wednesday, January 3, 2018 7:20:12 PM
To: Sergey Malinin
Cc: ceph-users
Subject: Re: [ceph-users] ceph luminous - SSD partitions disssapeared

They were not
After I change it manually I was still unable to start the service
Further more, a reboot screed up permissions again

 ls -al /dev/sda*
brw-rw 1 root disk 8, 0 Jan  3 11:10 /dev/sda
brw-rw 1 root disk 8, 1 Jan  3 11:10 /dev/sda1
brw-rw 1 root disk 8, 2 Jan  3 11:10 /dev/sda2
[root@osd01 ~]# chown ceph:ceph /dev/sda1
[root@osd01 ~]# chown ceph:ceph /dev/sda2
[root@osd01 ~]# ls -al /dev/sda*
brw-rw 1 root disk 8, 0 Jan  3 11:10 /dev/sda
brw-rw 1 ceph ceph 8, 1 Jan  3 11:10 /dev/sda1
brw-rw 1 ceph ceph 8, 2 Jan  3 11:10 /dev/sda2
[root@osd01 ~]# systemctl start ceph-osd@3
[root@osd01 ~]# systemctl status ceph-osd@3
● ceph-osd@3.service - Ceph object storage daemon osd.3
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; 
vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Wed 2018-01-03 
11:18:09 EST; 5s ago
  Process: 3823 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i 
--setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
  Process: 3818 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster 
${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 3823 (code=exited, status=1/FAILURE)

Jan 03 11:18:09 osd01.tor.medavail.net<http://osd01.tor.medavail.net> 
systemd[1]: Unit ceph-osd@3.service entered failed state.
Jan 03 11:18:09 osd01.tor.medavail.net<http://osd01.tor.medavail.net> 
systemd[1]: ceph-osd@3.service failed.



ceph-osd[3823]: 2018-01-03 11:18:08.515687 7fa55aec8d00 -1 
bluestore(/var/lib/ceph/osd/ceph-3/block.db) _read_bdev_label unable to decode 
label at offset 102: buffer::malformed_input: void bluesto
ceph-osd[3823]: 2018-01-03 11:18:08.515710 7fa55aec8d00 -1 
bluestore(/var/lib/ceph/osd/ceph-3) _open_db check block 
device(/var/lib/ceph/osd/ceph-3/block.db) label returned: (22) Invalid argument



This is very odd as the server was working fine

What is the proper procedure for replacing a failed SSD drive used by Blustore  
?



On 3 January 2018 at 10:23, Sergey Malinin 
<h...@newmail.com<mailto:h...@newmail.com>> wrote:
Are actual devices (not only udev links) owned by user “ceph”?


From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Steven Vacaroaia <ste...@gmail.com<mailto:ste...@gmail.com>>
Sent: Wednesday, January 3, 2018 6:19:45 PM
To: ceph-users
Subject: [ceph-users] ceph luminous - SSD partitions disssapeared

Hi,

After a reboot, all the partitions created on the SSD drive dissapeared
They were used by bluestore DB and WAL so the OSD are down

The following error message are in /var/log/messages


Jan  3 09:54:12 osd01 ceph-osd: 2018-01-03 09:54:12.992218 7f4b52b9ed00 -1 
bluestore(/var/lib/ceph/osd/ceph-6) _open_db /var/lib/ceph/osd/ceph-6/block.db 
link target doesn't exist
Jan  3 09:54:12 osd01 ceph-osd: 2018-01-03 09:54:12.993231 7f7ad37b1d00 -1 
bluestore(/var/lib/ceph/osd/ceph-5) _open_db /var/lib/ceph/osd/ceph-5/block.db 
link target doesn't exist

Then I decided to take this opportunity and "assume" a dead SSD thiuse recreate 
partitions

I zapped /dev/sda and then
I used this 
http://ceph.com/geen-categorie/ceph-recover-osds-after-ssd-journal-failure/  to 
recreate partition for ceph-3
Unfortunatelyy it is now "complaining' about permissions but they seem fine

Jan  3 09:54:12 osd01 ceph-osd: 2018-01-03 09:54:12.992120 7f74003d1d00 -1 
bdev(0x562336677800 /var/lib/ceph/osd/ceph-3/block.db) open open got: (13) 
Permission denied
Jan  3 09:54:12 osd01 ceph-osd: 2018-01-03 09:54:12.992131 7f74003d1d00 -1 
bluestore(/var/lib/ceph/osd/ceph-3) _open_db add block 
device(/var/lib/ceph/osd/ceph-3/block.db) returned: (13) Permission denied

 ls -al /var/lib/ceph/osd/ceph-3/
total 60
drwxr-xr-x  2 ceph ceph 310 Jan  2 16:39 .
drwxr-x---. 7 ceph ceph 131 Jan  2 16:39 ..
-rw-r--r--  1 root root 183 Jan  2 16:39 activate.monmap
-rw-r--r--  1 ceph ceph   3 Jan  2 16:39 active
lrwxrwxrwx  1 ceph ceph  58 Jan  2 16:32 block -> 
/dev/disk/by-partuuid/13560618-5942-4c7e-922a-1fafddb4a4d2
lrwxrwxrwx  1 ceph ceph  58 Jan  2 16:32 block.db -> 
/dev/disk/by-partuuid/5f610ecb-cb78-44d3-b503-016840d33ff6
-rw-r--r--  1 ceph ceph  37 Jan  2 16:32 block.db_uuid
-rw-r--r--  1 ceph ceph  37 Jan  2 16:32 block_uuid
l

Re: [ceph-users] ceph luminous - SSD partitions disssapeared

2018-01-03 Thread Sergey Malinin
Are actual devices (not only udev links) owned by user “ceph”?


From: ceph-users  on behalf of Steven 
Vacaroaia 
Sent: Wednesday, January 3, 2018 6:19:45 PM
To: ceph-users
Subject: [ceph-users] ceph luminous - SSD partitions disssapeared

Hi,

After a reboot, all the partitions created on the SSD drive dissapeared
They were used by bluestore DB and WAL so the OSD are down

The following error message are in /var/log/messages


Jan  3 09:54:12 osd01 ceph-osd: 2018-01-03 09:54:12.992218 7f4b52b9ed00 -1 
bluestore(/var/lib/ceph/osd/ceph-6) _open_db /var/lib/ceph/osd/ceph-6/block.db 
link target doesn't exist
Jan  3 09:54:12 osd01 ceph-osd: 2018-01-03 09:54:12.993231 7f7ad37b1d00 -1 
bluestore(/var/lib/ceph/osd/ceph-5) _open_db /var/lib/ceph/osd/ceph-5/block.db 
link target doesn't exist

Then I decided to take this opportunity and "assume" a dead SSD thiuse recreate 
partitions

I zapped /dev/sda and then
I used this 
http://ceph.com/geen-categorie/ceph-recover-osds-after-ssd-journal-failure/  to 
recreate partition for ceph-3
Unfortunatelyy it is now "complaining' about permissions but they seem fine

Jan  3 09:54:12 osd01 ceph-osd: 2018-01-03 09:54:12.992120 7f74003d1d00 -1 
bdev(0x562336677800 /var/lib/ceph/osd/ceph-3/block.db) open open got: (13) 
Permission denied
Jan  3 09:54:12 osd01 ceph-osd: 2018-01-03 09:54:12.992131 7f74003d1d00 -1 
bluestore(/var/lib/ceph/osd/ceph-3) _open_db add block 
device(/var/lib/ceph/osd/ceph-3/block.db) returned: (13) Permission denied

 ls -al /var/lib/ceph/osd/ceph-3/
total 60
drwxr-xr-x  2 ceph ceph 310 Jan  2 16:39 .
drwxr-x---. 7 ceph ceph 131 Jan  2 16:39 ..
-rw-r--r--  1 root root 183 Jan  2 16:39 activate.monmap
-rw-r--r--  1 ceph ceph   3 Jan  2 16:39 active
lrwxrwxrwx  1 ceph ceph  58 Jan  2 16:32 block -> 
/dev/disk/by-partuuid/13560618-5942-4c7e-922a-1fafddb4a4d2
lrwxrwxrwx  1 ceph ceph  58 Jan  2 16:32 block.db -> 
/dev/disk/by-partuuid/5f610ecb-cb78-44d3-b503-016840d33ff6
-rw-r--r--  1 ceph ceph  37 Jan  2 16:32 block.db_uuid
-rw-r--r--  1 ceph ceph  37 Jan  2 16:32 block_uuid
lrwxrwxrwx  1 ceph ceph  58 Jan  2 16:32 block.wal -> 
/dev/disk/by-partuuid/04d38ce7-c9e7-4648-a3f5-7b459e508109



Anyone had to deal with a similar issue ?

How d I fix the permission ?

What is the proper procedure for dealing with a "dead' SSD ?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "ceph -s" shows no osds

2018-01-03 Thread Sergey Malinin
What version are you using? Luminous needs mgr daemons running.


From: ceph-users  on behalf of Hüseyin 
Atatür YILDIRIM 
Sent: Wednesday, January 3, 2018 5:15:30 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] "ceph -s" shows no osds


Hello,

I try to setup Ceph cluster on Ubuntu 16.04. I’ve setup 1 monitor-osd (hostname 
mon01) and 2 osd hosts (osd01 and osd02). At one stage, I issued

   ceph-deploy  osd create mon01:sdb1 osd01:sdb1 osd02:sdb1

and ran successfully.  But when I issued below from the admin host:

ssh mon01 sudo ceph –s

cluster 9c7303db-56ab-4ddf-9fb8-1882754a4411
 health HEALTH_ERR
64 pgs are stuck inactive for more than 300 seconds
64 pgs stuck inactive
64 pgs stuck unclean
no osds
 monmap e1: 1 mons at {mon01=192.168.122.158:6789/0}
election epoch 4, quorum 0 mon01
 osdmap e1: 0 osds: 0 up, 0 in
flags sortbitwise,require_jewel_osds
  pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
  64 creating


There’s no  osds in the cluster. Can you please help?

Regards,
Atatur


[cid:imageaa3447.PNG@f0efafdb.489ec152] 
[cid:image99c333.JPG@b63449d5.448e22a1]
Hüseyin Atatür YILDIRIM
SİSTEM MÜHENDİSİ
Üniversiteler Mah. İhsan Doğramacı Bul. ODTÜ Teknokent Havelsan A.Ş. 23/B 
Çankaya Ankara TÜRKİYE
[cid:image1bff86.PNG@64ce8842.46a02b82] +90 312 292 74 00   
[cid:image448150.PNG@6bd70914.47a367ef] +90 312 219 57 97


[cid:image32c65d.JPG@4a85730c.4fb253b5]
YASAL UYARI: Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul 
ve Şartlar dokümanına tabidir. 

LEGAL NOTICE: This e-mail is subject to the Terms and Conditions document which 
can be accessed with this link. 


[http://www.havelsan.com.tr/Library/images/mail/email.jpg]  Lütfen 
gerekmedikçe bu sayfa ve eklerini yazdırmayınız / Please consider the 
environment before printing this email


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.80 Firefly released

2014-05-11 Thread Sergey Malinin
 
 # ceph tell mon.* injectargs '--mon_osd_allow_primary_affinity true'
 
 Ignore the mon.a: injectargs: failed to parse arguments: true 
 warnings, this appears to be a bug [0].
 
 

It will work this way: 
ceph tell mon.* injectargs -- --mon_osd_allow_primary_affinity=true

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS over CEPH - best practice

2014-05-07 Thread Sergey Malinin
http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices
 


On Wednesday, May 7, 2014 at 15:06, Andrei Mikhailovsky wrote:

 
 Vlad, is there a howto somewhere describing the steps on how to setup iscsi 
 multipathing over ceph? It looks like a good alternative to nfs
 
 Thanks
 
 From: Vlad Gorbunov vadi...@gmail.com (mailto:vadi...@gmail.com)
 To: Andrei Mikhailovsky and...@arhont.com (mailto:and...@arhont.com)
 Cc: ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
 Sent: Wednesday, 7 May, 2014 12:02:09 PM
 Subject: Re: [ceph-users] NFS over CEPH - best practice
 
 For XenServer or VMware is better to use iscsi client to tgtd with ceph 
 support. You can install tgtd on osd or monitor server and use multipath for 
 failover.
 
 On Wed, May 7, 2014 at 9:47 PM, Andrei Mikhailovsky and...@arhont.com 
 (mailto:and...@arhont.com) wrote:
  Hello guys,
  
  I would like to offer NFS service to the XenServer and VMWare hypervisors 
  for storing vm images. I am currently running ceph rbd with kvm, which is 
  working reasonably well.
  
  What would be the best way of running NFS services over CEPH, so that the 
  XenServer and VMWare's vm disk images are stored in ceph storage over NFS?
  
  Many thanks
  
  Andrei 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >