[ceph-users] RBD corruption when removing tier cache

2017-11-30 Thread Jan Pekař - Imatic
Hi all, today I tested adding SSD cache tier to pool. Everything worked, but when I tried to remove it and run rados -p hot-pool cache-flush-evict-all I got rbd_data.9c000238e1f29. failed to flush /rbd_data.9c000238e1f29.: (2) No such file or directory

[ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-06 Thread Jan Pekař - Imatic
Hi, I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu 1:2.8+dfsg-6+deb9u3 I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6. When I tested the cluster, I detected strange and severe problem. On first node I'm running qemu hosts with librados disk connection to

Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-07 Thread Jan Pekař - Imatic
was deadlocked, the worst case that I would expect would be your guest OS complaining about hung kernel tasks related to disk IO (since the disk wouldn't be responding). On Mon, Nov 6, 2017 at 6:02 PM, Jan Pekař - Imatic <jan.pe...@imatic.cz <mailto:jan.pe...@imatic.cz>> wrote:

Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-07 Thread Jan Pekař - Imatic
attached inside QEMU/KVM virtuals. JP On 7.11.2017 10:57, Piotr Dałek wrote: On 17-11-07 12:02 AM, Jan Pekař - Imatic wrote: Hi, I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu 1:2.8+dfsg-6+deb9u3 I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6. When I

Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-08 Thread Jan Pekař - Imatic
was deadlocked, the worst case that I would expect would be your guest OS complaining about hung kernel tasks related to disk IO (since the disk wouldn't be responding). On Mon, Nov 6, 2017 at 6:02 PM, Jan Pekař - Imatic <jan.pe...@imatic.cz <mailto:jan.pe...@imatic.cz>> wro

Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-07 Thread Jan Pekař - Imatic
den Hollander wrote: Op 7 november 2017 om 10:14 schreef Jan Pekař - Imatic <jan.pe...@imatic.cz>: Additional info - it is not librbd related, I mapped disk through rbd map and it was the same - virtuals were stuck/frozen. I happened exactly when in my log appeared Why aren't you

Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem

2017-11-07 Thread Jan Pekař - Imatic
st trying a different version of QEMU and/or different host OS since loss of a disk shouldn't hang it -- only potentially the guest OS. On Tue, Nov 7, 2017 at 5:17 AM, Jan Pekař - Imatic <jan.pe...@imatic.cz <mailto:jan.pe...@imatic.cz>> wrote: I'm calling kill -STOP to simulat

[ceph-users] Cluster stuck in failed state after power failure - please help

2017-12-11 Thread Jan Pekař - Imatic
Hi all, hope that somebody can help me. I have home ceph installation. After power failure (it can happen in datacenter also) my ceph booted in non-consistent state. I was backfilling data on one new disk during power failure. First time it booted without some OSDs, but I fixed that. Now I

Re: [ceph-users] Cluster stuck in failed state after power failure - please help

2017-12-11 Thread Jan Pekař - Imatic
that pg data from osd's? In osd logs I can see, that backfilling is continuing etc, so they have correct informations or they are running previous operations before power failure. With regards Jan Pekar On 11.12.2017 19:07, Jan Pekař - Imatic wrote: Hi all, hope that somebody can help me. I have

Re: [ceph-users] Cluster stuck in failed state after power failure - please help

2017-12-11 Thread Jan Pekař - Imatic
c 11, 2017 at 1:08 PM Jan Pekař - Imatic <jan.pe...@imatic.cz <mailto:jan.pe...@imatic.cz>> wrote: Hi all, hope that somebody can help me. I have home ceph installation. After power failure (it can happen in datacenter also) my ceph booted in non-consistent state.

Re: [ceph-users] Cluster stuck in failed state after power failure - please help

2017-12-11 Thread Jan Pekař - Imatic
at seeing up an mgr daemon. On Mon, Dec 11, 2017, 2:07 PM Jan Pekař - Imatic <jan.pe...@imatic.cz <mailto:jan.pe...@imatic.cz>> wrote: Hi, thank you for response. I started mds manually and accessed cephfs, I'm not running mgr yet, it is not necessary. I just responde

[ceph-users] rbd-nbd timeout and crash

2017-12-06 Thread Jan Pekař - Imatic
Hi, I run to overloaded cluster (deep-scrub running) for few seconds and rbd-nbd client timeouted, and device become unavailable. block nbd0: Connection timed out block nbd0: shutting down sockets block nbd0: Connection timed out print_req_error: I/O error, dev nbd0, sector 2131833856

Re: [ceph-users] rbd-nbd timeout and crash

2017-12-06 Thread Jan Pekař - Imatic
Hi, On 6.12.2017 15:24, Jason Dillaman wrote: On Wed, Dec 6, 2017 at 3:46 AM, Jan Pekař - Imatic <jan.pe...@imatic.cz> wrote: Hi, I run to overloaded cluster (deep-scrub running) for few seconds and rbd-nbd client timeouted, and device become unavailable. block nbd0: Connection tim

Re: [ceph-users] RBD corruption when removing tier cache

2017-12-02 Thread Jan Pekař - Imatic
ays to flush all objects (like turn off VMs, set short time to evict or target size) and remove overlay after that. With regards Jan Pekar On 1.12.2017 03:43, Jan Pekař - Imatic wrote: Hi all, today I tested adding SSD cache tier to pool. Everything worked, but when I tried to remove it and run rados

[ceph-users] Problem with OSD down and problematic rbd object

2018-01-05 Thread Jan Pekař - Imatic
Hi all, yesterday I got OSD down with error 2018-01-04 06:47:25.304513 7fe6eda51700 -1 log_channel(cluster) log [ERR] : 6.20 repair 1 missing, 0 inconsistent objects 2018-01-04 06:47:25.312861 7fe6eda51700 -1 log_channel(cluster) log [ERR] : 6.20 repair 3 errors, 2 fixed 2018-01-04

Re: [ceph-users] rbd-nbd timeout and crash

2018-01-04 Thread Jan Pekař - Imatic
egards Jan Pekar On 6.12.2017 23:58, David Turner wrote: Do you have the FS mounted with a trimming ability?  What are your mount options? On Wed, Dec 6, 2017 at 5:30 PM Jan Pekař - Imatic <jan.pe...@imatic.cz <mailto:jan.pe...@imatic.cz>> wrote: Hi, On 6.12.2017 15:24

Re: [ceph-users] Corrupted files on CephFS since Luminous upgrade

2018-02-26 Thread Jan Pekař - Imatic
I think I hit the same issue. I have corrupted data on cephfs and I don't remember the same issue before Luminous (i did the same tests before). It is on my test 1 node cluster with lower memory then recommended (so server is swapping) but it shouldn't lose data (it never did before). So

Re: [ceph-users] OSD crash during pg repair - recovery_info.ss.clone_snaps.end and other problems

2018-03-07 Thread Jan Pekař - Imatic
On 6.3.2018 22:28, Gregory Farnum wrote: On Sat, Mar 3, 2018 at 2:28 AM Jan Pekař - Imatic <jan.pe...@imatic.cz <mailto:jan.pe...@imatic.cz>> wrote: Hi all, I have few problems on my cluster, that are maybe linked together and now caused OSD down during pg repair.

Re: [ceph-users] Corrupted files on CephFS since Luminous upgrade

2018-03-03 Thread Jan Pekař - Imatic
and let you know. With regards Jan Pekar On 28.2.2018 15:14, David C wrote: On 27 Feb 2018 06:46, "Jan Pekař - Imatic" <jan.pe...@imatic.cz <mailto:jan.pe...@imatic.cz>> wrote: I think I hit the same issue. I have corrupted data on cephfs and I don't r

Re: [ceph-users] Corrupted files on CephFS since Luminous upgrade

2018-03-03 Thread Jan Pekař - Imatic
On 3.3.2018 11:12, Yan, Zheng wrote: On Tue, Feb 27, 2018 at 2:29 PM, Jan Pekař - Imatic <jan.pe...@imatic.cz> wrote: I think I hit the same issue. I have corrupted data on cephfs and I don't remember the same issue before Luminous (i did the same tests before). It is on my test 1 node c

[ceph-users] OSD crash during pg repair - recovery_info.ss.clone_snaps.end and other problems

2018-03-03 Thread Jan Pekař - Imatic
Hi all, I have few problems on my cluster, that are maybe linked together and now caused OSD down during pg repair. First few notes about my cluster: 4 nodes, 15 OSDs installed on Luminous (no upgrade). Replicated pools with 1 pool (pool 6) cached by ssd disks. I don't detect any hardware

Re: [ceph-users] Unfound object on erasure when recovering

2018-10-04 Thread Jan Pekař - Imatic
m appeared before trying to re-balance my cluster and was invisible to me. But it never happened before and scrub and depp-scrub is running regularly. I don't know where to continue with debugging this problem. JP On 3.10.2018 08:47, Jan Pekař - Imatic wrote: Hi all, I'm playing with my testi

[ceph-users] Unfound object on erasure when recovering

2018-10-03 Thread Jan Pekař - Imatic
Hi all, I'm playing with my testing cluster with ceph 12.2.8 installed. It happened to me for the second time, that I have 1 unfound objects on erasure coded pool. I have erasure with 3+1 configuration. First time I was adding additional disk. During cluster rebalance I noticed one unfound

Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-06-07 Thread Jan Pekař - Imatic
and was not successful or the monitors reacted not correctly in this situation and didn't complete key exchange with OSD's. After system disk replacement on the problematic mon, verify_authorizer problem was not in log anymore. With regards Jan Pekar On 01/05/2019 13.58, Jan Pekař - Imatic wrote: Today problem

Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-05-01 Thread Jan Pekař - Imatic
Today problem reappeared. Restarting mon helps, but it is no solving the issue. Is there any way how to debug that? Can I dump this keys from MON, from OSD or other components? Can I debug key exchange? Thank you On 27/04/2019 10.56, Jan Pekař - Imatic wrote: On 26/04/2019 21.50, Gregory

Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-04-27 Thread Jan Pekař - Imatic
On 26/04/2019 21.50, Gregory Farnum wrote: On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic wrote: Hi, yesterday my cluster reported slow request for minutes and after restarting OSDs (reporting slow requests) it stuck with peering PGs. Whole cluster was not responding and IO stopped. I

[ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-04-26 Thread Jan Pekař - Imatic
Hi, yesterday my cluster reported slow request for minutes and after restarting OSDs (reporting slow requests) it stuck with peering PGs. Whole cluster was not responding and IO stopped. I also notice, that problem was with cephx - all OSDs were reporting the same (even the same number of