Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-11-20 Thread Yan, Zheng
you can run 13.2.1 mds on another machine. kill all client sessions and wait until purge queue is empty. then it's safe to run 13.2.2 mds. run command "cephfs-journal-tool --rank=cephfs_name:rank --journal=purge_queue header get" purge queue is empty when write_pos == expire_pos On Wed, Nov

[ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-11-20 Thread Chris Martin
I am also having this problem. Zheng (or anyone else), any idea how to perform this downgrade on a node that is also a monitor and an OSD node? dpkg complains of a dependency conflict when I try to install ceph-mds_13.2.1-1xenial_amd64.deb: ``` dpkg: dependency problems prevent configuration of

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Alfredo Daniel Rezinovsky
On 08/10/18 11:47, Yan, Zheng wrote: On Mon, Oct 8, 2018 at 9:46 PM Alfredo Daniel Rezinovsky wrote: On 08/10/18 10:20, Yan, Zheng wrote: On Mon, Oct 8, 2018 at 9:07 PM Alfredo Daniel Rezinovsky wrote: On 08/10/18 09:45, Yan, Zheng wrote: On Mon, Oct 8, 2018 at 6:40 PM Alfredo Daniel

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Yan, Zheng
On Mon, Oct 8, 2018 at 9:46 PM Alfredo Daniel Rezinovsky wrote: > > > > On 08/10/18 10:20, Yan, Zheng wrote: > > On Mon, Oct 8, 2018 at 9:07 PM Alfredo Daniel Rezinovsky > > wrote: > >> > >> > >> On 08/10/18 09:45, Yan, Zheng wrote: > >>> On Mon, Oct 8, 2018 at 6:40 PM Alfredo Daniel Rezinovsky

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Alfredo Daniel Rezinovsky
On 08/10/18 10:20, Yan, Zheng wrote: On Mon, Oct 8, 2018 at 9:07 PM Alfredo Daniel Rezinovsky wrote: On 08/10/18 09:45, Yan, Zheng wrote: On Mon, Oct 8, 2018 at 6:40 PM Alfredo Daniel Rezinovsky wrote: On 08/10/18 07:06, Yan, Zheng wrote: On Mon, Oct 8, 2018 at 5:43 PM Sergey Malinin

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Alfredo Daniel Rezinovsky
On 08/10/18 10:32, Sergey Malinin wrote: On 8.10.2018, at 16:07, Alfredo Daniel Rezinovsky mailto:alfrenov...@gmail.com>> wrote: So I can stopt  cephfs-data-scan, run the import, downgrade, and then reset the purge queue? I suggest that you backup metadata pool so that in case of

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Sergey Malinin
> On 8.10.2018, at 16:07, Alfredo Daniel Rezinovsky > wrote: > > So I can stopt cephfs-data-scan, run the import, downgrade, and then reset > the purge queue? I suggest that you backup metadata pool so that in case of failure you can continue with data scan from where you stopped. I've

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Yan, Zheng
On Mon, Oct 8, 2018 at 9:07 PM Alfredo Daniel Rezinovsky wrote: > > > > On 08/10/18 09:45, Yan, Zheng wrote: > > On Mon, Oct 8, 2018 at 6:40 PM Alfredo Daniel Rezinovsky > > wrote: > >> On 08/10/18 07:06, Yan, Zheng wrote: > >>> On Mon, Oct 8, 2018 at 5:43 PM Sergey Malinin wrote: > >

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Alfredo Daniel Rezinovsky
On 08/10/18 09:45, Yan, Zheng wrote: On Mon, Oct 8, 2018 at 6:40 PM Alfredo Daniel Rezinovsky wrote: On 08/10/18 07:06, Yan, Zheng wrote: On Mon, Oct 8, 2018 at 5:43 PM Sergey Malinin wrote: On 8.10.2018, at 12:37, Yan, Zheng wrote: On Mon, Oct 8, 2018 at 4:37 PM Sergey Malinin

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Yan, Zheng
On Mon, Oct 8, 2018 at 5:43 PM Sergey Malinin wrote: > > > > > On 8.10.2018, at 12:37, Yan, Zheng wrote: > > > > On Mon, Oct 8, 2018 at 4:37 PM Sergey Malinin wrote: > >> > >> What additional steps need to be taken in order to (try to) regain access > >> to the fs providing that I backed up

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Sergey Malinin
> On 8.10.2018, at 12:37, Yan, Zheng wrote: > > On Mon, Oct 8, 2018 at 4:37 PM Sergey Malinin wrote: >> >> What additional steps need to be taken in order to (try to) regain access to >> the fs providing that I backed up metadata pool, created alternate metadata >> pool and ran

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Yan, Zheng
On Mon, Oct 8, 2018 at 4:37 PM Sergey Malinin wrote: > > What additional steps need to be taken in order to (try to) regain access to > the fs providing that I backed up metadata pool, created alternate metadata > pool and ran scan_extents, scan_links, scan_inodes, and somewhat recursive >

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-08 Thread Sergey Malinin
What additional steps need to be taken in order to (try to) regain access to the fs providing that I backed up metadata pool, created alternate metadata pool and ran scan_extents, scan_links, scan_inodes, and somewhat recursive scrub. After that I only mounted the fs read-only to backup the

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-07 Thread Yan, Zheng
Sorry. this is caused wrong backport. downgrading mds to 13.2.1 and marking mds repaird can resolve this. Yan, Zheng On Sat, Oct 6, 2018 at 8:26 AM Sergey Malinin wrote: > > Update: > I discovered http://tracker.ceph.com/issues/24236 and > https://github.com/ceph/ceph/pull/22146 > Make sure

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-07 Thread Sergey Malinin
I was able to start MDS and mount the fs with broken ownership/permissions and 8k out of millions files in lost+found. > On 7.10.2018, at 02:04, Sergey Malinin wrote: > > I'm at scan_links now, will post an update once it has finished. > Have you reset the journal after fs recovery as

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-06 Thread Sergey Malinin
I'm at scan_links now, will post an update once it has finished. Have you reset the journal after fs recovery as suggested in the doc? quote: If the damaged filesystem contains dirty journal data, it may be recovered next with: cephfs-journal-tool --rank=:0 event recover_dentries list

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-05 Thread Sergey Malinin
Update: I discovered http://tracker.ceph.com/issues/24236 and https://github.com/ceph/ceph/pull/22146 Make sure that it is not relevant in your case before proceeding to operations that modify on-disk data. > On

[ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-09-26 Thread Sergey Malinin
Hello, Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2. After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are damaged. Resetting purge_queue does not seem to work well as journal still appears to be damaged. Can anybody help? mds log: -789>

Re: [ceph-users] MDS damaged

2018-07-15 Thread Nicolas Huillard
Le dimanche 15 juillet 2018 à 11:01 -0500, Adam Tygart a écrit : > Check out the message titled "IMPORTANT: broken luminous 12.2.6 > release in repo, do not upgrade" > > It sounds like 12.2.7 should come *soon* to fix this transparently. Thanks. I didn't notice this one. I should monitor more

Re: [ceph-users] MDS damaged

2018-07-15 Thread Adam Tygart
Check out the message titled "IMPORTANT: broken luminous 12.2.6 release in repo, do not upgrade" It sounds like 12.2.7 should come *soon* to fix this transparently. -- Adam On Sun, Jul 15, 2018 at 10:28 AM, Nicolas Huillard wrote: > Hi all, > > I have the same problem here: > * during the

Re: [ceph-users] MDS damaged

2018-07-15 Thread Nicolas Huillard
Hi all, I have the same problem here: * during the upgrade from 12.2.5 to 12.2.6 * I restarted all the OSD server in turn, which did not trigger any bad thing * a few minutes after upgrading the OSDs/MONs/MDSs/MGRs (all on the same set of servers) and unsetting noout, I upgraded the clients,

Re: [ceph-users] MDS damaged

2018-07-13 Thread Alessandro De Salvo
Hi Dan, you're right, I was following the mimic instructions (which indeed worked on my mimic testbed), but luminous is different and I missed the additional step. Works now, thanks!     Alessandro Il 13/07/18 17:51, Dan van der Ster ha scritto: On Fri, Jul 13, 2018 at 4:07 PM

Re: [ceph-users] MDS damaged

2018-07-13 Thread Dan van der Ster
On Fri, Jul 13, 2018 at 4:07 PM Alessandro De Salvo wrote: > However, I cannot reduce the number of mdses anymore, I was used to do > that with e.g.: > > ceph fs set cephfs max_mds 1 > > Trying this with 12.2.6 has apparently no effect, I am left with 2 > active mdses. Is this another bug? Are

Re: [ceph-users] MDS damaged

2018-07-13 Thread Alessandro De Salvo
Thanks all, 100..inode, mds_snaptable and 1..inode were not corrupted, so I left them as they were. I have re-injected all the bad objects, for all mdses (2 per filesysytem) and all filesystems I had (2), and after setiing the mdses as repaired my filesystems are back!

Re: [ceph-users] MDS damaged

2018-07-13 Thread Yan, Zheng
On Thu, Jul 12, 2018 at 11:39 PM Alessandro De Salvo wrote: > > Some progress, and more pain... > > I was able to recover the 200. using the ceph-objectstore-tool for > one of the OSDs (all identical copies) but trying to re-inject it just with > rados put was giving no error while the

Re: [ceph-users] MDS damaged

2018-07-13 Thread Adam Tygart
Bluestore. On Fri, Jul 13, 2018, 05:56 Dan van der Ster wrote: > Hi Adam, > > Are your osds bluestore or filestore? > > -- dan > > > On Fri, Jul 13, 2018 at 7:38 AM Adam Tygart wrote: > > > > I've hit this today with an upgrade to 12.2.6 on my backup cluster. > > Unfortunately there were

Re: [ceph-users] MDS damaged

2018-07-13 Thread Dan van der Ster
Hi Adam, Are your osds bluestore or filestore? -- dan On Fri, Jul 13, 2018 at 7:38 AM Adam Tygart wrote: > > I've hit this today with an upgrade to 12.2.6 on my backup cluster. > Unfortunately there were issues with the logs (in that the files > weren't writable) until after the issue struck.

Re: [ceph-users] MDS damaged

2018-07-12 Thread Adam Tygart
I've hit this today with an upgrade to 12.2.6 on my backup cluster. Unfortunately there were issues with the logs (in that the files weren't writable) until after the issue struck. 2018-07-13 00:16:54.437051 7f5a0a672700 -1 log_channel(cluster) log [ERR] : 5.255 full-object read crc 0x4e97b4e !=

Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo
Some progress, and more pain... I was able to recover the 200. using the ceph-objectstore-tool for one of the OSDs (all identical copies) but trying to re-inject it just with rados put was giving no error while the get was still giving the same I/O error. So the solution was to rm the

Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo
Unfortunately yes, all the OSDs were restarted a few times, but no change. Thanks,     Alessandro Il 12/07/18 15:55, Paul Emmerich ha scritto: This might seem like a stupid suggestion, but: have you tried to restart the OSDs? I've also encountered some random CRC errors that only showed

Re: [ceph-users] MDS damaged

2018-07-12 Thread Paul Emmerich
This might seem like a stupid suggestion, but: have you tried to restart the OSDs? I've also encountered some random CRC errors that only showed up when trying to read an object, but not on scrubbing, that magically disappeared after restarting the OSD. However, in my case it was clearly related

Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo
Il 12/07/18 11:20, Alessandro De Salvo ha scritto: Il 12/07/18 10:58, Dan van der Ster ha scritto: On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum wrote: On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo wrote: OK, I found where the object is: ceph osd map cephfs_metadata

Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo
Il 12/07/18 10:58, Dan van der Ster ha scritto: On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum wrote: On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo wrote: OK, I found where the object is: ceph osd map cephfs_metadata 200. osdmap e632418 pool 'cephfs_metadata' (10) object

Re: [ceph-users] MDS damaged

2018-07-12 Thread Dan van der Ster
On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum wrote: > > On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo > wrote: >> >> OK, I found where the object is: >> >> >> ceph osd map cephfs_metadata 200. >> osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg >>

Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo
> Il giorno 11 lug 2018, alle ore 23:25, Gregory Farnum ha > scritto: > >> On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo >> wrote: >> OK, I found where the object is: >> >> >> ceph osd map cephfs_metadata 200. >> osdmap e632418 pool 'cephfs_metadata' (10) object

Re: [ceph-users] MDS damaged

2018-07-11 Thread Gregory Farnum
On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo < alessandro.desa...@roma1.infn.it> wrote: > OK, I found where the object is: > > > ceph osd map cephfs_metadata 200. > osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg > 10.844f3494 (10.14) -> up ([23,35,18], p23)

Re: [ceph-users] MDS damaged

2018-07-11 Thread Alessandro De Salvo
OK, I found where the object is: ceph osd map cephfs_metadata 200. osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23) So, looking at the osds 23, 35 and 18 logs in fact I see: osd.23:

Re: [ceph-users] MDS damaged

2018-07-11 Thread John Spray
On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo wrote: > > Hi John, > > in fact I get an I/O error by hand too: > > > rados get -p cephfs_metadata 200. 200. > error getting cephfs_metadata/200.: (5) Input/output error Next step would be to go look for corresponding

Re: [ceph-users] MDS damaged

2018-07-11 Thread Alessandro De Salvo
Hi John, in fact I get an I/O error by hand too: rados get -p cephfs_metadata 200. 200. error getting cephfs_metadata/200.: (5) Input/output error Can this be recovered someway? Thanks,     Alessandro Il 11/07/18 18:33, John Spray ha scritto: On Wed, Jul 11,

Re: [ceph-users] MDS damaged

2018-07-11 Thread John Spray
On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo wrote: > > Hi, > > after the upgrade to luminous 12.2.6 today, all our MDSes have been > marked as damaged. Trying to restart the instances only result in > standby MDSes. We currently have 2 filesystems active and 2 MDSes each. > > I found the

Re: [ceph-users] MDS damaged

2018-07-11 Thread Alessandro De Salvo
Hi Gregory, thanks for the reply. I have the dump of the metadata pool, but I'm not sure what to check there. Is it what you mean? The cluster was operational until today at noon, when a full restart of the daemons was issued, like many other times in the past. I was trying to issue the

Re: [ceph-users] MDS damaged

2018-07-11 Thread Gregory Farnum
Have you checked the actual journal objects as the "journal export" suggested? Did you identify any actual source of the damage before issuing the "repaired" command? What is the history of the filesystems on this cluster? On Wed, Jul 11, 2018 at 8:10 AM Alessandro De Salvo <

[ceph-users] MDS damaged

2018-07-11 Thread Alessandro De Salvo
Hi, after the upgrade to luminous 12.2.6 today, all our MDSes have been marked as damaged. Trying to restart the instances only result in standby MDSes. We currently have 2 filesystems active and 2 MDSes each. I found the following error messages in the mon: mds.0 :6800/2412911269

Re: [ceph-users] MDS damaged

2017-10-26 Thread Daniel Davidson
Thanks John.  It has been up for a few hours now, and I am slowly adding more workload to it over time, just so I can see what id going on better. I was wondering, since this object is used to delete data, if there was a chance that deleting data from the system could cause it to be used and

Re: [ceph-users] MDS damaged

2017-10-26 Thread John Spray
On Thu, Oct 26, 2017 at 12:40 PM, Daniel Davidson wrote: > And at the risk of bombing the mailing list, I can also see that the > stray7_head omapkey is not being recreated: > rados -p igbhome_data listomapkeys 100. > stray0_head > stray1_head > stray2_head >

Re: [ceph-users] MDS damaged

2017-10-26 Thread Daniel Davidson
And at the risk of bombing the mailing list, I can also see that the stray7_head omapkey is not being recreated: rados -p igbhome_data listomapkeys 100. stray0_head stray1_head stray2_head stray3_head stray4_head stray5_head stray6_head stray8_head stray9_head On 10/26/2017 05:08 AM,

Re: [ceph-users] MDS damaged

2017-10-26 Thread Daniel Davidson
I increased the logging of the mds to try and get some more information.  I think the relevant lines are: 2017-10-26 05:03:17.661683 7f1c598a6700  0 mds.0.cache.dir(607) _fetched missing object for [dir 607 ~mds0/stray7/ [2,head] auth v=108918871 cv=0/0 ap=1+0+0 state=1610645632 f(v1

Re: [ceph-users] MDS damaged

2017-10-26 Thread Ronny Aasen
if you were following this page: http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-pg/ then there is normally hours of troubleshooting in the following paragraph, before finally admitting defeat and marking the object as lost: "It is possible that there are other

Re: [ceph-users] MDS damaged

2017-10-25 Thread danield
Hi Ronny, >From the documentation, I thought this was the proper way to resolve the issue. Dan > On 24. okt. 2017 19:14, Daniel Davidson wrote: >> Our ceph system is having a problem. >> >> A few days a go we had a pg that was marked as inconsistent, and today I >> fixed it with a: >> >> #ceph

Re: [ceph-users] MDS damaged

2017-10-25 Thread Daniel Davidson
The system is down again saying it is missing the same stray7 again. 2017-10-25 11:24:29.736774 mds.0 [WRN] failed to reconnect caps for missing inodes: 2017-10-25 11:24:29.736779 mds.0 [WRN]  ino 100147160e6 2017-10-25 11:24:29.753665 mds.0 [ERR] dir 607 object missing on disk; some files

Re: [ceph-users] MDS damaged

2017-10-25 Thread Daniel Davidson
Thanks for the information. I did: # ceph daemon mds.ceph-0 scrub_path / repair recursive Saw in the logs it finished # ceph daemon mds.ceph-0 flush journal Saw in the logs it finished #ceph mds fail 0 #ceph mds repaired 0 And it went back to missing stray7 again.  I added that back like we

Re: [ceph-users] MDS damaged

2017-10-25 Thread John Spray
Commands that start "ceph daemon" take mds. rather than a rank (notes on terminology here: http://docs.ceph.com/docs/master/cephfs/standby/). The name is how you would refer to the daemon from systemd, it's often set to the hostname where the daemon is running by default. John On Wed, Oct 25,

Re: [ceph-users] MDS damaged

2017-10-25 Thread Daniel Davidson
I do have a problem with running the commands you mentioned to repair the mds: # ceph daemon mds.0 scrub_path admin_socket: exception getting command descriptions: [Errno 2] No such file or directory admin_socket: exception getting command descriptions: [Errno 2] No such file or directory

Re: [ceph-users] MDS damaged

2017-10-25 Thread Daniel Davidson
John, thank you so much.  After doing the initial rados command you mentioned it is back up and running.  It did complain about a bunch of files which frankly are not important having duplicate inodes, but I will run those repair and scrub commands you mentioned and get it back clean again.

Re: [ceph-users] MDS damaged

2017-10-25 Thread Ronny Aasen
On 24. okt. 2017 19:14, Daniel Davidson wrote: Our ceph system is having a problem. A few days a go we had a pg that was marked as inconsistent, and today I fixed it with a: #ceph pg repair 1.37c then a file was stuck as missing so I did a: #ceph pg 1.37c mark_unfound_lost delete pg has 1

Re: [ceph-users] MDS damaged

2017-10-24 Thread Daniel Davidson
This finally finished: 2017-10-24 22:50:11.766519 7f775e539bc0  1 scavenge_dentries: frag 607. is corrupt, overwriting Events by type:   OPEN: 5640344   SESSION: 10   SUBTREEMAP: 8070   UPDATE: 1384964 Errors: 0 I truncated: #cephfs-journal-tool journal reset old journal was

Re: [ceph-users] MDS damaged

2017-10-24 Thread Daniel Davidson
Out of desperation, I started with the disaster recovery guide: http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/ After exporting the journal, I started doing: cephfs-journal-tool event recover_dentries summary And that was about 7 hours ago, and it is still running.  I am getting a