Re: [ceph-users] How to remove a faulty bucket?

2017-12-11 Thread Martin Emrich
Hi!

There were originally about 110k objects, but after several bucket check 
attempts, they "multiplied" and we're at 330k now (I assume it took the 
original objects, tried to create a new index, crashed, and left both the old 
an new "entries" or whatever there).

Could it relate to bucket versioning being enabled? (I already suspended it, 
but nothing changed).

I use bluestore on Ceph 12.2.2, cluster health is OK. I'll trigger a deep-scrub 
on all SSD OSDs (carrying the index pool) to be sure.

Thanks,

Martin


Am 11.12.17, 16:55 schrieb "Robin H. Johnson" :

On Mon, Dec 11, 2017 at 09:29:11AM +, Martin Emrich wrote:
> 
> Yes indeed. Running "radosgw-admin bi list" results in an incomplete 
300MB JSON file, before it freezes.
That's a very good starting point to debug.
The bucket index is stored inside the OMAP area of a raw RADOS object.
(in a filestore OSD it's in the LevelDB),  I wonder if you have
corruption or something else awry. 
How many objects were in this bucket? The number from 'bucket stats' is
a good starting point.

Newer versions of Jewel do report OMAP inconsistency after deep-scrub, so
that would be a help in your case too.
   


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Calamari ( what a nightmare !!! )

2017-12-11 Thread David
Hi!

I think Calamari is more or less deprecated now that ceph luminous is out with 
Ceph Manager and the dashboard plugin:

http://docs.ceph.com/docs/master/mgr/dashboard/

You could also try out:

https://www.openattic.org/ 

or if you want to start a whole new cluster without needing to know how to 
operate it ;)

https://croit.io/ 

The latter isn't open sourced yet as far as I know.

Kind Regards,

David


> 12 dec. 2017 kl. 02:18 skrev DHD.KOHA :
> 
> Hello list,
> 
> Newbie here,
> 
> After managing to install ceph, with all possible ways that I could manage  
> on 4 nodes, 4 osd and 3 monitors , with ceph-deploy and latter with 
> ceph-ansible, I thought to to give a try to install CALAMARI on UBUNTU 14.04 
> ( another separate server being not a node or anything in a cluster ).
> 
> After all the mess of salt 2014.7.5 and different UBUNTU's since I am 
> installing nodes on xenial but CALAMARI on trusty while the calamari packages 
> on node come from download.ceph.com and trusty, I ended up having a server 
> that refuses to gather anything from anyplace at all.
> 
> 
> # salt '*' ceph.get_heartbeats
> c1.zz.prv:
>The minion function caused an exception: Traceback (most recent call last):
>  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1020, in 
> _thread_return
>return_data = func(*args, **kwargs)
>  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 467, in 
> get_heartbeats
>service_data = service_status(filename)
>  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 526, in 
> service_status
>fsid = json.loads(admin_socket(socket_path, ['status'], 
> 'json'))['cluster_fsid']
>KeyError: 'cluster_fsid'
> c2.zz.prv:
>The minion function caused an exception: Traceback (most recent call last):
>  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1020, in 
> _thread_return
>return_data = func(*args, **kwargs)
>  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 467, in 
> get_heartbeats
>service_data = service_status(filename)
>  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 526, in 
> service_status
>fsid = json.loads(admin_socket(socket_path, ['status'], 
> 'json'))['cluster_fsid']
>KeyError: 'cluster_fsid'
> c3.zz.prv:
>The minion function caused an exception: Traceback (most recent call last):
>  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1020, in 
> _thread_return
>return_data = func(*args, **kwargs)
>  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 467, in 
> get_heartbeats
>service_data = service_status(filename)
>  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 526, in 
> service_status
>fsid = json.loads(admin_socket(socket_path, ['status'], 
> 'json'))['cluster_fsid']
>KeyError: 'cluster_fsid'
> 
> which means obviously that I am doing something WRONG and I have no IDEA what 
> is it.
> 
> Given the fact that documentation on the matter is very poor to limited,
> 
> Is there anybody out-there with some clues or hints that is willing to share ?
> 
> Regards,
> 
> Harry.
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Calamari ( what a nightmare !!! )

2017-12-11 Thread DHD.KOHA

Hello list,

Newbie here,

After managing to install ceph, with all possible ways that I could 
manage  on 4 nodes, 4 osd and 3 monitors , with ceph-deploy and latter 
with ceph-ansible, I thought to to give a try to install CALAMARI on 
UBUNTU 14.04 ( another separate server being not a node or anything in a 
cluster ).


After all the mess of salt 2014.7.5 and different UBUNTU's since I am 
installing nodes on xenial but CALAMARI on trusty while the calamari 
packages on node come from download.ceph.com and trusty, I ended up 
having a server that refuses to gather anything from anyplace at all.



# salt '*' ceph.get_heartbeats
c1.zz.prv:
The minion function caused an exception: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1020, in 
_thread_return
return_data = func(*args, **kwargs)
  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 467, in 
get_heartbeats
service_data = service_status(filename)
  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 526, in 
service_status
fsid = json.loads(admin_socket(socket_path, ['status'], 
'json'))['cluster_fsid']
KeyError: 'cluster_fsid'
c2.zz.prv:
The minion function caused an exception: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1020, in 
_thread_return
return_data = func(*args, **kwargs)
  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 467, in 
get_heartbeats
service_data = service_status(filename)
  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 526, in 
service_status
fsid = json.loads(admin_socket(socket_path, ['status'], 
'json'))['cluster_fsid']
KeyError: 'cluster_fsid'
c3.zz.prv:
The minion function caused an exception: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1020, in 
_thread_return
return_data = func(*args, **kwargs)
  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 467, in 
get_heartbeats
service_data = service_status(filename)
  File "/var/cache/salt/minion/extmods/modules/ceph.py", line 526, in 
service_status
fsid = json.loads(admin_socket(socket_path, ['status'], 
'json'))['cluster_fsid']
KeyError: 'cluster_fsid'

which means obviously that I am doing something WRONG and I have no IDEA 
what is it.


Given the fact that documentation on the matter is very poor to limited,

Is there anybody out-there with some clues or hints that is willing to 
share ?


Regards,

Harry.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recommendations for I/O (blk-mq) scheduler for HDDs and SSDs?

2017-12-11 Thread German Anders
Hi Patrick,

Some thoughts about blk-mq:

*(virtio-blk)*

   - it's activated by default on kernels >= 3.13 on driver virtio-blk

   - *The blk-mq feature is currently implemented, and enabled by default,
   in the following drivers: virtio-blk, mtip32xx, nvme, and rbd*. (
   https://access.redhat.com/documentation/en-us/red_hat_enter
   prise_linux/7/html/7.2_release_notes/storage
   

   )

   - can be checked with "cat /sys/block/vda/queue/scheduler", it appears
   as none

   - https://serverfault.com/questions/693348/what-does-it-mean-
   when-linux-has-no-i-o-scheduler
   


*hosts local disk (scsi-mq)*

   - with disks "sda" (scsi), either rotational or ssd IT'S NOT activated
   by default on ubuntu (cat /sys/block/sda/queue/scheduler)

   - canonical deactivated in >= 3.18 https://bugs.launchpad.
   net/ubuntu/+source/linux/+bug/1397061
   

   - suse says it does not suit, for scsi rotational, but it's ok for SSD:
   https://doc.opensuse.org/documentation/leap/tuning/html
   /book.sle.tuning/cha.tuning.io.html#cha.tuning.io.scsimq
   


   - redhat says: "*The scsi-mq feature is provided as a Technology Preview
   in Red Hat Enterprise Linux 7.2. To enable scsi-mq, specify
   scsi_mod.use_blk_mq=y on the kernel command line. The default value is n
   (disabled).*" (https://access.redhat.com/doc
   umentation/en-us/red_hat_enterprise_linux/7/html/7.2_release
   _notes/storage
   

   )

   - how to change it: vi /etc/default/grub:
GRUB_CMDLINE_LINUX="scsi_mod.use_blk_mq=1";
   update-grub; reboot

*ceph (rbd)*

   - it's activated by default: *The blk-mq feature is currently
   implemented, and enabled by default, in the following drivers: virtio-blk,
   mtip32xx, nvme, and rbd*. (https://access.redhat.com/doc
   umentation/en-us/red_hat_enterprise_linux/7/html/7.2_release
   _notes/storage
   

   )

*multipath (device mapper; dm / dm-mpath)*

   - how to change it: dm_mod.use_blk_mq=y

   - deactivated by default, how to verify: *To determine whether DM
   multipath is using blk-mq on a system, cat the file
   /sys/block/dm-X/dm/use_blk_mq, where dm-X is replaced by the DM multipath
   device of interest. This file is read-only and reflects what the global
   value in /sys/module/dm_mod/parameters/use_blk_mq was at the time the
   request-based DM multipath device was created*. (
   https://access.redhat.com/documentation/en-us/red_hat_enter
   prise_linux/7/html/7.2_release_notes/storage
   

   )

   - I thought it would not make any sense, since iscsi is by definition
   (network) much slower than SSD/NVMe, which is what blk-mq was generated
   for, but...:* It may be beneficial to set dm_mod.use_blk_mq=y if the
   underlying SCSI devices are also using blk-mq, as doing so reduces locking
   overhead at the DM layer*. (redhat)


observations

   - WARNING low performance https://www.redhat.com/archives/dm-devel/
   2016-February/msg00036.html

   - request-based device mapper targets planned for 4.1

   - now with >= 4.12 linux come with BFQ, a scheduler based en blk-mq

We try in our environment several schedulers but we did't notice a real
important improvement, in order to justify a global change in the whole
environment. But the best thing is to change/test/document an repeat again
and again :)

Hope it helps

Best,



*German*

2017-12-11 18:17 GMT-03:00 Patrick Fruh :

> Hi,
>
>
>
> after reading a lot about I/O schedulers and performance gains with
> blk-mq, I switched to a custom 4.14.5 kernel with  CONFIG_SCSI_MQ_DEFAULT
> enabled to have blk-mq for all devices on my cluster.
>
>
>
> This allows me to use the following schedulers for HDDs and SSDs:
>
> mq-deadline, kyber, bfq, none
>
>
>
> I’ve currently set the HDD scheduler to bfq and the SSD scheduler to none,
> however I’m still not sure if this is the best solution performance-wise.
>
> Does anyone have more experience with this and can maybe give me a
> recommendation? I’m not even sure if blk-mq is a good idea for ceph, since
> I haven’t really found anything on the topic.
>
>
>
> Best,
>
> Patrick
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list

[ceph-users] Recommendations for I/O (blk-mq) scheduler for HDDs and SSDs?

2017-12-11 Thread Patrick Fruh
Hi,

after reading a lot about I/O schedulers and performance gains with blk-mq, I 
switched to a custom 4.14.5 kernel with  CONFIG_SCSI_MQ_DEFAULT enabled to have 
blk-mq for all devices on my cluster.

This allows me to use the following schedulers for HDDs and SSDs:
mq-deadline, kyber, bfq, none

I've currently set the HDD scheduler to bfq and the SSD scheduler to none, 
however I'm still not sure if this is the best solution performance-wise.
Does anyone have more experience with this and can maybe give me a 
recommendation? I'm not even sure if blk-mq is a good idea for ceph, since I 
haven't really found anything on the topic.

Best,
Patrick
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade from 12.2.1 to 12.2.2 broke my CephFs

2017-12-11 Thread Willem Jan Withagen

On 11/12/2017 15:13, Tobias Prousa wrote:

Hi there,

I'm running a CEPH cluster for some libvirt VMs and a CephFS providing 
/home to ~20 desktop machines. There are 4 Hosts running 4 MONs, 4MGRs, 
3MDSs (1 active, 2 standby) and 28 OSDs in total. This cluster is up and 
running since the days of Bobtail (yes, including CephFS).


Might consider shutting down 1 MON, since MONs need to be in odd 
numbers, And for you cluster 3 is more than sufficient.


For reasons why, read either the Ceph docs, or search this maillinglist.

Probably doesn;t help with your problem, but could you help prevent a 
split-brain situation in the future.


--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster stuck in failed state after power failure - please help

2017-12-11 Thread Jan Pekař - Imatic
You were right. I thought, that mgr is just optional component. I fixed 
mgr startup and after that my

ceph -s
responded with correct state and
ceph pg dump
returns immediately correct output.

Thank you for help, everything looks good. For other issues I noticed I 
will create new thread if necessary.


With regards
Jan Pekar

On 11.12.2017 20:10, David Turner wrote:
If you're running Luminous 12.x.x then the mgr daemon is responsible for 
the output of most commands that query the cluster. If you're having 
problems with commands not returning when you're querying the cluster, 
look at seeing up an mgr daemon.



On Mon, Dec 11, 2017, 2:07 PM Jan Pekař - Imatic > wrote:


Hi,

thank you for response. I started mds manually and accessed cephfs, I'm
not running mgr yet, it is not necessary.
I just responded to mailing list. It looks, that dump from ceph is
incorrect and cluster is "working somehow". So problem is different,
that my mgr or mds is not running.

With regards
Jan Pekar

On 11.12.2017 19:42, David Turner wrote:
 > It honestly just looks like your MDS and MGR daemons are not
configured
 > to start automatically.  Try starting them manually and then if that
 > fixes the things, go through and enable them to start automatically.
 > Assuming you use systemctl the commands to check and fix this
would be
 > something like these.  The first one will show you all of the things
 > started with ceph.target.
 >
 > sudo systemctl list-dependencies ceph.target
 > sudo systemctl enable ceph-mgr@servername
 > sudo systemctl enable ceph-mds@servername
 >
 > On Mon, Dec 11, 2017 at 1:08 PM Jan Pekař - Imatic

 > >> wrote:
 >
 >     Hi all,
 >
 >     hope that somebody can help me. I have home ceph installation.
 >     After power failure (it can happen in datacenter also) my
ceph booted in
 >     non-consistent state.
 >
 >     I was backfilling data on one new disk during power failure.
First time
 >     it booted without some OSDs, but I fixed that. Now I have all
my OSD's
 >     running, but cluster state looks like this after some time. :
 >
 >
 >         cluster:
 >           id:     2d9bf17f-3d50-4a59-8359-abc8328fe801
 >           health: HEALTH_WARN
 >                   1 filesystem is degraded
 >                   1 filesystem has a failed mds daemon
 >                   noout,nodeep-scrub flag(s) set
 >                   no active mgr
 >                   317162/12520262 objects misplaced (2.533%)
 >                   Reduced data availability: 52 pgs inactive, 29
pgs down, 1
 >     pg peering, 1 pg stale
 >                   Degraded data redundancy: 2099528/12520262 objects
 >     degraded
 >     (16.769%), 427 pgs unclean, 368 pgs degraded, 368 pgs undersized
 >                   1/3 mons down, quorum imatic-mce-2,imatic-mce
 >
 >         services:
 >           mon: 3 daemons, quorum imatic-mce-2,imatic-mce, out of
quorum:
 >     obyvak
 >           mgr: no daemons active
 >           mds: cephfs-0/1/1 up , 1 failed
 >           osd: 8 osds: 8 up, 8 in; 61 remapped pgs
 >                flags noout,nodeep-scrub
 >
 >         data:
 >           pools:   8 pools, 896 pgs
 >           objects: 4446k objects, 9119 GB
 >           usage:   9698 GB used, 2290 GB / 11988 GB avail
 >           pgs:     2.455% pgs unknown
 >                    3.348% pgs not active
 >                    2099528/12520262 objects degraded (16.769%)
 >                    317162/12520262 objects misplaced (2.533%)
 >                    371 stale+active+clean
 >                    183 active+undersized+degraded
 >                    154 stale+active+undersized+degraded
 >                    85  active+clean
 >                    22  unknown
 >                    19  stale+down
 >                    14
 >     stale+active+undersized+degraded+remapped+backfill_wait
 >                    13 
active+undersized+degraded+remapped+backfill_wait

 >                    10  down
 >                    6   active+clean+remapped
 >                    6   stale+active+clean+remapped
 >                    5   stale+active+remapped+backfill_wait
 >                    2   active+remapped+backfill_wait
 >                    2 
  stale+active+undersized+degraded+remapped+backfilling

 >                    1   active+undersized+degraded+remapped
 >                    1 
  active+undersized+degraded+remapped+backfilling

 >                    1   stale+peering
 >                    1   stale+active+clean+scrubbing
 >
  

[ceph-users] High Load and High Apply Latency

2017-12-11 Thread John Petrini
Hi List,

I've got a 5 OSD node cluster running hammer. All of the OSD servers are
identical but one has about 3-4x higher load than the others and the OSD's
in this node are reporting high apply latency.

The cause of the load appears to be the OSD processes. About half of the
OSD processes are using between 100-185% CPU putting keeping the proc
pegged around 85% utilization overall. In comparison others servers in the
cluster are sitting around 30% CPU utilization and are report ~1.5ms of
apply latency.

A few days ago I restarted the OSD processes and the problem went away but
now three days later it has returned. I don't see anything in the logs and
there's no iowait on the disks.

Anyone have any ideas on how I can troubleshoot this further?

Thank You,

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster stuck in failed state after power failure - please help

2017-12-11 Thread David Turner
If you're running Luminous 12.x.x then the mgr daemon is responsible for
the output of most commands that query the cluster. If you're having
problems with commands not returning when you're querying the cluster, look
at seeing up an mgr daemon.

On Mon, Dec 11, 2017, 2:07 PM Jan Pekař - Imatic 
wrote:

> Hi,
>
> thank you for response. I started mds manually and accessed cephfs, I'm
> not running mgr yet, it is not necessary.
> I just responded to mailing list. It looks, that dump from ceph is
> incorrect and cluster is "working somehow". So problem is different,
> that my mgr or mds is not running.
>
> With regards
> Jan Pekar
>
> On 11.12.2017 19:42, David Turner wrote:
> > It honestly just looks like your MDS and MGR daemons are not configured
> > to start automatically.  Try starting them manually and then if that
> > fixes the things, go through and enable them to start automatically.
> > Assuming you use systemctl the commands to check and fix this would be
> > something like these.  The first one will show you all of the things
> > started with ceph.target.
> >
> > sudo systemctl list-dependencies ceph.target
> > sudo systemctl enable ceph-mgr@servername
> > sudo systemctl enable ceph-mds@servername
> >
> > On Mon, Dec 11, 2017 at 1:08 PM Jan Pekař - Imatic  > > wrote:
> >
> > Hi all,
> >
> > hope that somebody can help me. I have home ceph installation.
> > After power failure (it can happen in datacenter also) my ceph
> booted in
> > non-consistent state.
> >
> > I was backfilling data on one new disk during power failure. First
> time
> > it booted without some OSDs, but I fixed that. Now I have all my
> OSD's
> > running, but cluster state looks like this after some time. :
> >
> >
> > cluster:
> >   id: 2d9bf17f-3d50-4a59-8359-abc8328fe801
> >   health: HEALTH_WARN
> >   1 filesystem is degraded
> >   1 filesystem has a failed mds daemon
> >   noout,nodeep-scrub flag(s) set
> >   no active mgr
> >   317162/12520262 objects misplaced (2.533%)
> >   Reduced data availability: 52 pgs inactive, 29 pgs
> down, 1
> > pg peering, 1 pg stale
> >   Degraded data redundancy: 2099528/12520262 objects
> > degraded
> > (16.769%), 427 pgs unclean, 368 pgs degraded, 368 pgs undersized
> >   1/3 mons down, quorum imatic-mce-2,imatic-mce
> >
> > services:
> >   mon: 3 daemons, quorum imatic-mce-2,imatic-mce, out of quorum:
> > obyvak
> >   mgr: no daemons active
> >   mds: cephfs-0/1/1 up , 1 failed
> >   osd: 8 osds: 8 up, 8 in; 61 remapped pgs
> >flags noout,nodeep-scrub
> >
> > data:
> >   pools:   8 pools, 896 pgs
> >   objects: 4446k objects, 9119 GB
> >   usage:   9698 GB used, 2290 GB / 11988 GB avail
> >   pgs: 2.455% pgs unknown
> >3.348% pgs not active
> >2099528/12520262 objects degraded (16.769%)
> >317162/12520262 objects misplaced (2.533%)
> >371 stale+active+clean
> >183 active+undersized+degraded
> >154 stale+active+undersized+degraded
> >85  active+clean
> >22  unknown
> >19  stale+down
> >14
> > stale+active+undersized+degraded+remapped+backfill_wait
> >13  active+undersized+degraded+remapped+backfill_wait
> >10  down
> >6   active+clean+remapped
> >6   stale+active+clean+remapped
> >5   stale+active+remapped+backfill_wait
> >2   active+remapped+backfill_wait
> >2
>  stale+active+undersized+degraded+remapped+backfilling
> >1   active+undersized+degraded+remapped
> >1   active+undersized+degraded+remapped+backfilling
> >1   stale+peering
> >1   stale+active+clean+scrubbing
> >
> > There are all OSD's up and running. Before that I completed
> > ceph osd out
> > on one of my disk and removed that disk from cluster because I don't
> > want to use it anymore. It triggered crush reweight and started to
> > rebuild my date. I thinkg that should not put my data in danger even
> I
> > saw that some of my PG's were undersized (why?) - but it is not now
> the
> > think.
> >
> > When I try to do
> > ceph pg dump
> > I have no response.
> >
> > But ceph osd dump show weird number of osd's on temporary PG's like
> > number 2147483647 . I thing that there is some
> > problem in some mon or
> > other database and peering process cannot complete.
> >
> 

Re: [ceph-users] Cluster stuck in failed state after power failure - please help

2017-12-11 Thread Jan Pekař - Imatic

Hi,

thank you for response. I started mds manually and accessed cephfs, I'm 
not running mgr yet, it is not necessary.
I just responded to mailing list. It looks, that dump from ceph is 
incorrect and cluster is "working somehow". So problem is different, 
that my mgr or mds is not running.


With regards
Jan Pekar

On 11.12.2017 19:42, David Turner wrote:
It honestly just looks like your MDS and MGR daemons are not configured 
to start automatically.  Try starting them manually and then if that 
fixes the things, go through and enable them to start automatically.  
Assuming you use systemctl the commands to check and fix this would be 
something like these.  The first one will show you all of the things 
started with ceph.target.


sudo systemctl list-dependencies ceph.target
sudo systemctl enable ceph-mgr@servername
sudo systemctl enable ceph-mds@servername

On Mon, Dec 11, 2017 at 1:08 PM Jan Pekař - Imatic > wrote:


Hi all,

hope that somebody can help me. I have home ceph installation.
After power failure (it can happen in datacenter also) my ceph booted in
non-consistent state.

I was backfilling data on one new disk during power failure. First time
it booted without some OSDs, but I fixed that. Now I have all my OSD's
running, but cluster state looks like this after some time. :


    cluster:
      id:     2d9bf17f-3d50-4a59-8359-abc8328fe801
      health: HEALTH_WARN
              1 filesystem is degraded
              1 filesystem has a failed mds daemon
              noout,nodeep-scrub flag(s) set
              no active mgr
              317162/12520262 objects misplaced (2.533%)
              Reduced data availability: 52 pgs inactive, 29 pgs down, 1
pg peering, 1 pg stale
              Degraded data redundancy: 2099528/12520262 objects
degraded
(16.769%), 427 pgs unclean, 368 pgs degraded, 368 pgs undersized
              1/3 mons down, quorum imatic-mce-2,imatic-mce

    services:
      mon: 3 daemons, quorum imatic-mce-2,imatic-mce, out of quorum:
obyvak
      mgr: no daemons active
      mds: cephfs-0/1/1 up , 1 failed
      osd: 8 osds: 8 up, 8 in; 61 remapped pgs
           flags noout,nodeep-scrub

    data:
      pools:   8 pools, 896 pgs
      objects: 4446k objects, 9119 GB
      usage:   9698 GB used, 2290 GB / 11988 GB avail
      pgs:     2.455% pgs unknown
               3.348% pgs not active
               2099528/12520262 objects degraded (16.769%)
               317162/12520262 objects misplaced (2.533%)
               371 stale+active+clean
               183 active+undersized+degraded
               154 stale+active+undersized+degraded
               85  active+clean
               22  unknown
               19  stale+down
               14 
stale+active+undersized+degraded+remapped+backfill_wait

               13  active+undersized+degraded+remapped+backfill_wait
               10  down
               6   active+clean+remapped
               6   stale+active+clean+remapped
               5   stale+active+remapped+backfill_wait
               2   active+remapped+backfill_wait
               2   stale+active+undersized+degraded+remapped+backfilling
               1   active+undersized+degraded+remapped
               1   active+undersized+degraded+remapped+backfilling
               1   stale+peering
               1   stale+active+clean+scrubbing

There are all OSD's up and running. Before that I completed
ceph osd out
on one of my disk and removed that disk from cluster because I don't
want to use it anymore. It triggered crush reweight and started to
rebuild my date. I thinkg that should not put my data in danger even I
saw that some of my PG's were undersized (why?) - but it is not now the
think.

When I try to do
ceph pg dump
I have no response.

But ceph osd dump show weird number of osd's on temporary PG's like
number 2147483647 . I thing that there is some
problem in some mon or
other database and peering process cannot complete.

What can I do next? I believed that cluster so much, so I have some data
I want back. Thank you very much. for help.

My ceph osd dump looks like this:


epoch 29442
fsid 2d9bf17f-3d50-4a59-8359-abc8328fe801
created 2014-12-10 23:00:49.140787
modified 2017-12-11 18:54:01.134091
flags noout,nodeep-scrub,sortbitwise,recovery_deletes
crush_version 14
full_ratio 0.97
backfillfull_ratio 0.91
nearfull_ratio 0.9
require_min_compat_client firefly
min_compat_client firefly
require_osd_release luminous
pool 0 'data' replicated size 2 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 27537 flags hashpspool

Re: [ceph-users] Cluster stuck in failed state after power failure - please help

2017-12-11 Thread Jan Pekař - Imatic

After some research it looks, that broken is mainly that dump from

ceph -s

and

ceph pg dump
(no response from that command)

But I can access data on cephfs (data so far tried).

So question is - why is that status stuck, how to fix that? Is there 
some mon? database to reset and refresh that pg data from osd's?


In osd logs I can see, that backfilling is continuing etc, so they have 
correct informations or they are running previous operations before 
power failure.


With regards
Jan Pekar

On 11.12.2017 19:07, Jan Pekař - Imatic wrote:

Hi all,

hope that somebody can help me. I have home ceph installation.
After power failure (it can happen in datacenter also) my ceph booted in 
non-consistent state.


I was backfilling data on one new disk during power failure. First time 
it booted without some OSDs, but I fixed that. Now I have all my OSD's 
running, but cluster state looks like this after some time. :



   cluster:
     id: 2d9bf17f-3d50-4a59-8359-abc8328fe801
     health: HEALTH_WARN
     1 filesystem is degraded
     1 filesystem has a failed mds daemon
     noout,nodeep-scrub flag(s) set
     no active mgr
     317162/12520262 objects misplaced (2.533%)
     Reduced data availability: 52 pgs inactive, 29 pgs down, 1 
pg peering, 1 pg stale
     Degraded data redundancy: 2099528/12520262 objects degraded 
(16.769%), 427 pgs unclean, 368 pgs degraded, 368 pgs undersized

     1/3 mons down, quorum imatic-mce-2,imatic-mce

   services:
     mon: 3 daemons, quorum imatic-mce-2,imatic-mce, out of quorum: obyvak
     mgr: no daemons active
     mds: cephfs-0/1/1 up , 1 failed
     osd: 8 osds: 8 up, 8 in; 61 remapped pgs
  flags noout,nodeep-scrub

   data:
     pools:   8 pools, 896 pgs
     objects: 4446k objects, 9119 GB
     usage:   9698 GB used, 2290 GB / 11988 GB avail
     pgs: 2.455% pgs unknown
  3.348% pgs not active
  2099528/12520262 objects degraded (16.769%)
  317162/12520262 objects misplaced (2.533%)
  371 stale+active+clean
  183 active+undersized+degraded
  154 stale+active+undersized+degraded
  85  active+clean
  22  unknown
  19  stale+down
  14  stale+active+undersized+degraded+remapped+backfill_wait
  13  active+undersized+degraded+remapped+backfill_wait
  10  down
  6   active+clean+remapped
  6   stale+active+clean+remapped
  5   stale+active+remapped+backfill_wait
  2   active+remapped+backfill_wait
  2   stale+active+undersized+degraded+remapped+backfilling
  1   active+undersized+degraded+remapped
  1   active+undersized+degraded+remapped+backfilling
  1   stale+peering
  1   stale+active+clean+scrubbing

There are all OSD's up and running. Before that I completed
ceph osd out
on one of my disk and removed that disk from cluster because I don't 
want to use it anymore. It triggered crush reweight and started to 
rebuild my date. I thinkg that should not put my data in danger even I 
saw that some of my PG's were undersized (why?) - but it is not now the 
think.


When I try to do
ceph pg dump
I have no response.

But ceph osd dump show weird number of osd's on temporary PG's like 
number 2147483647. I thing that there is some problem in some mon or 
other database and peering process cannot complete.


What can I do next? I believed that cluster so much, so I have some data 
I want back. Thank you very much. for help.


My ceph osd dump looks like this:


epoch 29442
fsid 2d9bf17f-3d50-4a59-8359-abc8328fe801
created 2014-12-10 23:00:49.140787
modified 2017-12-11 18:54:01.134091
flags noout,nodeep-scrub,sortbitwise,recovery_deletes
crush_version 14
full_ratio 0.97
backfillfull_ratio 0.91
nearfull_ratio 0.9
require_min_compat_client firefly
min_compat_client firefly
require_osd_release luminous
pool 0 'data' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 27537 flags hashpspool 
crash_replay_interval 45 min_read_recency_for_promote 1 
min_write_recency_for_promote 1 stripe_width 0 application cephfs
pool 1 'metadata' replicated size 3 min_size 1 crush_rule 1 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 27537 flags hashpspool 
min_read_recency_for_promote 1 min_write_recency_for_promote 1 
stripe_width 0 application cephfs
pool 2 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 28088 flags hashpspool 
min_read_recency_for_promote 1 min_write_recency_for_promote 1 
stripe_width 0 application rbd

     removed_snaps [1~5]
pool 3 'nonreplicated' replicated size 1 min_size 1 crush_rule 2 
object_hash rjenkins pg_num 192 pgp_num 192 last_change 27537 flags 
hashpspool min_read_recency_for_promote 1 min_write_recency_for_promote 
1 

Re: [ceph-users] Cluster stuck in failed state after power failure - please help

2017-12-11 Thread David Turner
It honestly just looks like your MDS and MGR daemons are not configured to
start automatically.  Try starting them manually and then if that fixes the
things, go through and enable them to start automatically.  Assuming you
use systemctl the commands to check and fix this would be something like
these.  The first one will show you all of the things started with
ceph.target.

sudo systemctl list-dependencies ceph.target
sudo systemctl enable ceph-mgr@servername
sudo systemctl enable ceph-mds@servername

On Mon, Dec 11, 2017 at 1:08 PM Jan Pekař - Imatic 
wrote:

> Hi all,
>
> hope that somebody can help me. I have home ceph installation.
> After power failure (it can happen in datacenter also) my ceph booted in
> non-consistent state.
>
> I was backfilling data on one new disk during power failure. First time
> it booted without some OSDs, but I fixed that. Now I have all my OSD's
> running, but cluster state looks like this after some time. :
>
>
>cluster:
>  id: 2d9bf17f-3d50-4a59-8359-abc8328fe801
>  health: HEALTH_WARN
>  1 filesystem is degraded
>  1 filesystem has a failed mds daemon
>  noout,nodeep-scrub flag(s) set
>  no active mgr
>  317162/12520262 objects misplaced (2.533%)
>  Reduced data availability: 52 pgs inactive, 29 pgs down, 1
> pg peering, 1 pg stale
>  Degraded data redundancy: 2099528/12520262 objects degraded
> (16.769%), 427 pgs unclean, 368 pgs degraded, 368 pgs undersized
>  1/3 mons down, quorum imatic-mce-2,imatic-mce
>
>services:
>  mon: 3 daemons, quorum imatic-mce-2,imatic-mce, out of quorum: obyvak
>  mgr: no daemons active
>  mds: cephfs-0/1/1 up , 1 failed
>  osd: 8 osds: 8 up, 8 in; 61 remapped pgs
>   flags noout,nodeep-scrub
>
>data:
>  pools:   8 pools, 896 pgs
>  objects: 4446k objects, 9119 GB
>  usage:   9698 GB used, 2290 GB / 11988 GB avail
>  pgs: 2.455% pgs unknown
>   3.348% pgs not active
>   2099528/12520262 objects degraded (16.769%)
>   317162/12520262 objects misplaced (2.533%)
>   371 stale+active+clean
>   183 active+undersized+degraded
>   154 stale+active+undersized+degraded
>   85  active+clean
>   22  unknown
>   19  stale+down
>   14  stale+active+undersized+degraded+remapped+backfill_wait
>   13  active+undersized+degraded+remapped+backfill_wait
>   10  down
>   6   active+clean+remapped
>   6   stale+active+clean+remapped
>   5   stale+active+remapped+backfill_wait
>   2   active+remapped+backfill_wait
>   2   stale+active+undersized+degraded+remapped+backfilling
>   1   active+undersized+degraded+remapped
>   1   active+undersized+degraded+remapped+backfilling
>   1   stale+peering
>   1   stale+active+clean+scrubbing
>
> There are all OSD's up and running. Before that I completed
> ceph osd out
> on one of my disk and removed that disk from cluster because I don't
> want to use it anymore. It triggered crush reweight and started to
> rebuild my date. I thinkg that should not put my data in danger even I
> saw that some of my PG's were undersized (why?) - but it is not now the
> think.
>
> When I try to do
> ceph pg dump
> I have no response.
>
> But ceph osd dump show weird number of osd's on temporary PG's like
> number 2147483647 <(214)%20748-3647>. I thing that there is some problem
> in some mon or
> other database and peering process cannot complete.
>
> What can I do next? I believed that cluster so much, so I have some data
> I want back. Thank you very much. for help.
>
> My ceph osd dump looks like this:
>
>
> epoch 29442
> fsid 2d9bf17f-3d50-4a59-8359-abc8328fe801
> created 2014-12-10 23:00:49.140787
> modified 2017-12-11 18:54:01.134091
> flags noout,nodeep-scrub,sortbitwise,recovery_deletes
> crush_version 14
> full_ratio 0.97
> backfillfull_ratio 0.91
> nearfull_ratio 0.9
> require_min_compat_client firefly
> min_compat_client firefly
> require_osd_release luminous
> pool 0 'data' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 27537 flags hashpspool
> crash_replay_interval 45 min_read_recency_for_promote 1
> min_write_recency_for_promote 1 stripe_width 0 application cephfs
> pool 1 'metadata' replicated size 3 min_size 1 crush_rule 1 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 27537 flags hashpspool
> min_read_recency_for_promote 1 min_write_recency_for_promote 1
> stripe_width 0 application cephfs
> pool 2 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 28088 flags hashpspool
> min_read_recency_for_promote 1 min_write_recency_for_promote 1
> stripe_width 0 application rbd
>  

Re: [ceph-users] Upgrade from 12.2.1 to 12.2.2 broke my CephFs

2017-12-11 Thread Tobias Prousa

Hi Zheng,

I think I managed to understand what you supposed me to do. The highest 
inode which was reported to be set erronously to be free was quite 
exactly identical with the highest inode in output of "cephfs-table tool 
all list inode". So I used take_ino as you supposed with an max_ino 
value slightly higher than that. Now MDS runs and I started a online MDS 
scrub.


What confused me in the beginning was that inode numbers in list output 
is given in hex but when using take_ino it has to be specified in dec. 
Had to study the source code to get that...


Tomorrow morning I'll see if things got stable again.

Once again thank you very much for your support. I will report back to 
the ML when I got news.


Best Regards,
Tobi




On 12/11/2017 05:19 PM, Tobias Prousa wrote:

Hi Zheng,

I did some more tests with cephfs-table-tool. I realized that disaster 
recovery implies to possibly reset inode table completely besides 
doing a session reset using something like


cephfs-table-tool all reset inode

Would that be close to what you suggested? Is it safe to reset 
complete inode table or will that wipe my file system?


Btw. cephfs-table-tool show reset inode gives me ~400k inodes, part of 
them in section 'free', part of them in section 'projected_free'.


Thanks,
Tobi





On 12/11/2017 04:28 PM, Yan, Zheng wrote:
On Mon, Dec 11, 2017 at 11:17 PM, Tobias Prousa 
 wrote:
These are essentially the first commands I did execute, in this 
exact order.

Additionally I did a:

ceph fs reset cephfs --yes-i-really-mean-it


how many active mds were there before the upgrading.

Any hint on how to find max inode number and do I understand that I 
should
remove every free-marked inode number that is there except the 
biggest one

which has to stay?

If you are not sure, you can just try removing 1 inode numbers
from inodetale


How to remove those inodes using cephfs-table-tool?


using cephfs-table-tool take_inos 




--
---
Dipl.-Inf. (FH) Tobias Prousa
Leiter Entwicklung Datenlogger

CAETEC GmbH
Industriestr. 1
D-82140 Olching
www.caetec.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Olching
Handelsregister: Amtsgericht München, HRB 183929
Geschäftsführung: Stephan Bacher, Andreas Wocke

Tel.: +49 (0)8142 / 50 13 60
Fax.: +49 (0)8142 / 50 13 69

eMail: tobias.pro...@caetec.de
Web:   http://www.caetec.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] public/cluster network

2017-12-11 Thread David C
Hi Roman

Whilst you can define multiple subnets in the public network directive, the
MONs still only bind to a single IP. Your clients need to be able to route
to that IP. From what you're saying, 172.x.x.x/24 is an isolated network,
so a client on the 10.x.x.x network is not going to be able to access the
cluster.

On Mon, Dec 11, 2017 at 9:15 AM, Roman  wrote:

> Hi all,
>
> We would like to implement the following setup.
> Our cloud nodes (CNs) for virtual machines  have two 10 Gbps NICs:
> 10.x.y.z/22 (routed through the backbone) and 172.x.y.z/24 (available only
> on servers within single rack). CNs and ceph nodes are in the same rack.
> Ceph nodes have two 10 Gpbs NICs in the same networks. We are going to use
> 172.x.y.z/24 as ceph cluster network for all ceph components' traffic (i.e.
> OSD/MGR/MON/MDS). But apart from that we are thinking about to use the same
> cluster network for CNs interactions with ceph nodes (since it is expected
> the network within single switch in rack to be much faster then the routed
> via backbone one).
> So 172.x.y.z/24 is for the following: pure ceph traffic, CNs <=> ceph
> nodes; 10.x.y.z/22 is for the rest type of ceph clients like VMs with
> mounted cephfs shares (since VMs  doesn't have access to 172.x.y.z/24 net).
> So I wonder if it's possible to implement something like the following:
> always use 172.x.y.z/24 if it is availabe on both source and destination
> otherwise use 10.x.y.z/22.
> We have just tried to specify the following in ceph.conf:
> cluster network = 172.x.y.z/24
> public network = 172.x.y.z/24, 10.x.y.z/22
>
> But it doesn't seem to work.
> There is an entry in Redhat Knowledgebase portal [1] called "Ceph Multiple
> public networks" but there is not solution provided yet.
>
> [1] https://access.redhat.com/solutions/1463363
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sudden omap growth on some OSDs

2017-12-11 Thread Gregory Farnum
Hmm, this does all sound odd. Have you tried just restarting the primary
OSD yet? That frequently resolves transient oddities like this.
If not, I'll go poke at the kraken source and one of the developers more
familiar with the recovery processes we're seeing here.
-Greg

On Fri, Dec 8, 2017 at 7:30 AM  wrote:

>
> 
> From: Gregory Farnum [gfar...@redhat.com]
> Sent: 07 December 2017 21:57
> To: Vasilakakos, George (STFC,RAL,SC)
> Cc: drakonst...@gmail.com; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Sudden omap growth on some OSDs
>
>
>
> On Thu, Dec 7, 2017 at 4:41 AM > wrote:
>
> 
> From: Gregory Farnum [gfar...@redhat.com]
> Sent: 06 December 2017 22:50
> To: David Turner
> Cc: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Sudden omap growth on some OSDs
>
> On Wed, Dec 6, 2017 at 2:35 PM David Turner >> wrote:
> I have no proof or anything other than a hunch, but OSDs don't trim omaps
> unless all PGs are healthy.  If this PG is actually not healthy, but the
> cluster doesn't realize it while these 11 involved OSDs do realize that the
> PG is unhealthy... You would see this exact problem.  The OSDs think a PG
> is unhealthy so they aren't trimming their omaps while the cluster doesn't
> seem to be aware of it and everything else is trimming their omaps properly.
>
> I think you're confusing omaps and OSDMaps here. OSDMaps, like omap, are
> stored in leveldb, but they have different trimming rules.
>
>
> I don't know what to do about it, but I hope it helps get you (or someone
> else on the ML) towards a resolution.
>
> On Wed, Dec 6, 2017 at 1:59 PM  >> wrote:
> Hi ceph-users,
>
> We have a Ceph cluster (running Kraken) that is exhibiting some odd
> behaviour.
> A couple weeks ago, the LevelDBs on some our OSDs started growing large
> (now at around 20G size).
>
> The one thing they have in common is the 11 disks with inflating LevelDBs
> are all in the set for one PG in one of our pools (EC 8+3). This pool
> started to see use around the time the LevelDBs started inflating.
> Compactions are running and they do go down in size a bit but the overall
> trend is one of rapid growth. The other 2000+ OSDs in the cluster have
> LevelDBs between 650M and 1.2G.
> This PG has nothing to separate it from the others in its pool, within 5%
> of average number of objects per PG, no hot-spotting in terms of load, no
> weird states reported by ceph status.
>
> The one odd thing about it is the pg query output mentions it is
> active+clean, but it has a recovery state, which it enters every morning
> between 9 and 10am, where it mentions a "might_have_unfound" situation and
> having probed all other set members. A deep scrub of the PG didn't turn up
> anything.
>
> You need to be more specific here. What do you mean it "enters into" the
> recovery state every morning?
>
> Here's what PG query showed me yesterday:
> "recovery_state": [
> {
> "name": "Started\/Primary\/Active",
> "enter_time": "2017-12-05 09:48:57.730385",
> "might_have_unfound": [
> {
> "osd": "79(1)",
> "status": "already probed"
> },
> {
> "osd": "337(9)",
> "status": "already probed"
> },... it goes on to list all peers of this OSD in that PG.
>
> IIRC that's just a normal thing when there's any kind of recovery
> happening — it builds up a set during peering of OSDs that might have data,
> in case it discovers stuff missing.
>
> OK. But this is the only PG mentioning "might_have_unfound" across the two
> most used pools in our cluster and it's the only one that has all of its
> omap dirs at sizes more than 15 times the average for the cluster.
>
>
>
> How many PGs are in your 8+3 pool, and are all your OSDs hosting EC pools?
> What are you using the cluster for?
>
> 2048 PGs in this pool, also another 2048 PG EC pool (same profile) and two
> more 1024 PG EC pools (also same profile). Then a set of RGW auxiliary
> pools with 3-way replication.
> I'm not 100% sure but I think all of our OSDs should have a few PGs from
> one of the EC pools. Our rules don't make a distinction so it's
> probabilistic. We're using the cluster as an object store, minor RGW use
> and custom gateways using libradosstriper.
> It's also worth pointing out that an OSD in that PG was taken out of the
> cluster earlier today and pg query shows the following weirdness:
> The primary thinks it's 

Re: [ceph-users] The way to minimize osd memory usage?

2017-12-11 Thread Subhachandra Chandra
I ran an experiment with 1GB memory per OSD using Bluestore. 12.2.2 made a
big difference.

In addition, you should have a look at your max object size. It looks like
you will see a jump in memory usage if a particular OSD happens to be the
primary for a number of objects being written in parallel. In our case
reducing the number of clients reduced memory requirements. Reducing max
object size should also reduce memory requirements on the OSD daemon.

Subhachandra



On Sun, Dec 10, 2017 at 1:01 PM,  wrote:

> Send ceph-users mailing list submissions to
> ceph-users@lists.ceph.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> or, via email, send a message with subject or body 'help' to
> ceph-users-requ...@lists.ceph.com
>
> You can reach the person managing the list at
> ceph-users-ow...@lists.ceph.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ceph-users digest..."
>
>
> Today's Topics:
>
>1. Re: RBD+LVM -> iSCSI -> VMWare (Donny Davis)
>2. Re: RBD+LVM -> iSCSI -> VMWare (Brady Deetz)
>3. Re: RBD+LVM -> iSCSI -> VMWare (Donny Davis)
>4. Re: RBD+LVM -> iSCSI -> VMWare (Brady Deetz)
>5. The way to minimize osd memory usage? (shadow_lin)
>6. Re: The way to minimize osd memory usage? (Konstantin Shalygin)
>7. Re: The way to minimize osd memory usage? (shadow_lin)
>8. Random checksum errors (bluestore on Luminous) (Martin Preuss)
>9. Re: The way to minimize osd memory usage? (David Turner)
>   10. what's the maximum number of OSDs per OSD server? (Igor Mendelev)
>   11. Re: what's the maximum number of OSDs per OSD server? (Nick Fisk)
>   12. Re: what's the maximum number of OSDs per OSD server?
>   (Igor Mendelev)
>   13. Re: RBD+LVM -> iSCSI -> VMWare (He?in Ejdesgaard M?ller)
>   14. Re: Random checksum errors (bluestore on Luminous) (Martin Preuss)
>   15. Re: what's the maximum number of OSDs per OSD server? (Nick Fisk)
>
>
> --
>
> Message: 1
> Date: Sun, 10 Dec 2017 00:26:39 +
> From: Donny Davis 
> To: Brady Deetz 
> Cc: Aaron Glenn , ceph-users
> 
> Subject: Re: [ceph-users] RBD+LVM -> iSCSI -> VMWare
> Message-ID:
>  mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Just curious but why not just use a hypervisor with rbd support? Are there
> VMware specific features you are reliant on?
>
> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz  wrote:
>
> > I'm testing using RBD as VMWare datastores. I'm currently testing with
> > krbd+LVM on a tgt target hosted on a hypervisor.
> >
> > My Ceph cluster is HDD backed.
> >
> > In order to help with write latency, I added an SSD drive to my
> hypervisor
> > and made it a writeback cache for the rbd via LVM. So far I've managed to
> > smooth out my 4k write latency and have some pleasing results.
> >
> > Architecturally, my current plan is to deploy an iSCSI gateway on each
> > hypervisor hosting that hypervisor's own datastore.
> >
> > Does anybody have any experience with this kind of configuration,
> > especially with regard to LVM writeback caching combined with RBD?
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> -- next part --
> An HTML attachment was scrubbed...
> URL:  attachments/20171210/4f055103/attachment-0001.html>
>
> --
>
> Message: 2
> Date: Sat, 9 Dec 2017 18:56:53 -0600
> From: Brady Deetz 
> To: Donny Davis 
> Cc: Aaron Glenn , ceph-users
> 
> Subject: Re: [ceph-users] RBD+LVM -> iSCSI -> VMWare
> Message-ID:
> 

[ceph-users] Cluster stuck in failed state after power failure - please help

2017-12-11 Thread Jan Pekař - Imatic

Hi all,

hope that somebody can help me. I have home ceph installation.
After power failure (it can happen in datacenter also) my ceph booted in 
non-consistent state.


I was backfilling data on one new disk during power failure. First time 
it booted without some OSDs, but I fixed that. Now I have all my OSD's 
running, but cluster state looks like this after some time. :



  cluster:
id: 2d9bf17f-3d50-4a59-8359-abc8328fe801
health: HEALTH_WARN
1 filesystem is degraded
1 filesystem has a failed mds daemon
noout,nodeep-scrub flag(s) set
no active mgr
317162/12520262 objects misplaced (2.533%)
Reduced data availability: 52 pgs inactive, 29 pgs down, 1 
pg peering, 1 pg stale
Degraded data redundancy: 2099528/12520262 objects degraded 
(16.769%), 427 pgs unclean, 368 pgs degraded, 368 pgs undersized

1/3 mons down, quorum imatic-mce-2,imatic-mce

  services:
mon: 3 daemons, quorum imatic-mce-2,imatic-mce, out of quorum: obyvak
mgr: no daemons active
mds: cephfs-0/1/1 up , 1 failed
osd: 8 osds: 8 up, 8 in; 61 remapped pgs
 flags noout,nodeep-scrub

  data:
pools:   8 pools, 896 pgs
objects: 4446k objects, 9119 GB
usage:   9698 GB used, 2290 GB / 11988 GB avail
pgs: 2.455% pgs unknown
 3.348% pgs not active
 2099528/12520262 objects degraded (16.769%)
 317162/12520262 objects misplaced (2.533%)
 371 stale+active+clean
 183 active+undersized+degraded
 154 stale+active+undersized+degraded
 85  active+clean
 22  unknown
 19  stale+down
 14  stale+active+undersized+degraded+remapped+backfill_wait
 13  active+undersized+degraded+remapped+backfill_wait
 10  down
 6   active+clean+remapped
 6   stale+active+clean+remapped
 5   stale+active+remapped+backfill_wait
 2   active+remapped+backfill_wait
 2   stale+active+undersized+degraded+remapped+backfilling
 1   active+undersized+degraded+remapped
 1   active+undersized+degraded+remapped+backfilling
 1   stale+peering
 1   stale+active+clean+scrubbing

There are all OSD's up and running. Before that I completed
ceph osd out
on one of my disk and removed that disk from cluster because I don't 
want to use it anymore. It triggered crush reweight and started to 
rebuild my date. I thinkg that should not put my data in danger even I 
saw that some of my PG's were undersized (why?) - but it is not now the 
think.


When I try to do
ceph pg dump
I have no response.

But ceph osd dump show weird number of osd's on temporary PG's like 
number 2147483647. I thing that there is some problem in some mon or 
other database and peering process cannot complete.


What can I do next? I believed that cluster so much, so I have some data 
I want back. Thank you very much. for help.


My ceph osd dump looks like this:


epoch 29442
fsid 2d9bf17f-3d50-4a59-8359-abc8328fe801
created 2014-12-10 23:00:49.140787
modified 2017-12-11 18:54:01.134091
flags noout,nodeep-scrub,sortbitwise,recovery_deletes
crush_version 14
full_ratio 0.97
backfillfull_ratio 0.91
nearfull_ratio 0.9
require_min_compat_client firefly
min_compat_client firefly
require_osd_release luminous
pool 0 'data' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 27537 flags hashpspool 
crash_replay_interval 45 min_read_recency_for_promote 1 
min_write_recency_for_promote 1 stripe_width 0 application cephfs
pool 1 'metadata' replicated size 3 min_size 1 crush_rule 1 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 27537 flags hashpspool 
min_read_recency_for_promote 1 min_write_recency_for_promote 1 
stripe_width 0 application cephfs
pool 2 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 28088 flags hashpspool 
min_read_recency_for_promote 1 min_write_recency_for_promote 1 
stripe_width 0 application rbd

removed_snaps [1~5]
pool 3 'nonreplicated' replicated size 1 min_size 1 crush_rule 2 
object_hash rjenkins pg_num 192 pgp_num 192 last_change 27537 flags 
hashpspool min_read_recency_for_promote 1 min_write_recency_for_promote 
1 stripe_width 0 application cephfs
pool 4 'replicated' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 192 pgp_num 192 last_change 27537 lfor 
17097/17097 flags hashpspool min_read_recency_for_promote 1 
min_write_recency_for_promote 1 stripe_width 0 application cephfs
pool 10 'erasure_3_1' erasure size 4 min_size 3 crush_rule 3 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 27537 lfor 9127/9127 flags 
hashpspool tiers 11 read_tier 11 write_tier 11 
min_write_recency_for_promote 1 stripe_width 4128 application cephfs
pool 11 'erasure_3_1_hot' replicated size 

Re: [ceph-users] Luminous rgw hangs after sighup

2017-12-11 Thread Casey Bodley
There have been other issues related to hangs during realm 
reconfiguration, ex http://tracker.ceph.com/issues/20937. We decided to 
revert the use of SIGHUP to trigger realm reconfiguration in 
https://github.com/ceph/ceph/pull/16807. I just started a backport of 
that for luminous.



On 12/11/2017 11:07 AM, Graham Allan wrote:

That's the issue I remember (#20763)!

The hang happened to me once, on this cluster, after upgrade from 
jewel to 12.2.2; then on Friday I disabled automatic bucket resharding 
due to some other problems - didn't get any logrotate-related hangs 
through the weekend. I wonder if these could be related?


Graham

On 12/11/2017 02:01 AM, Martin Emrich wrote:

Hi!

This sounds like http://tracker.ceph.com/issues/20763 (or indeed 
http://tracker.ceph.com/issues/20866).


It is still present in 12.2.2 (just tried it). My workaround is to 
exclude radosgw from logrotate (remove "radosgw" from 
/etc/logrotate.d/ceph) from being SIGHUPed, and to rotate the logs 
manually from time to time and completely restarting the radosgw 
processes one after the other on my radosgw cluster.


Regards,

Martin

Am 08.12.17, 18:58 schrieb "ceph-users im Auftrag von Graham Allan" 
:


 I noticed this morning that all four of our rados gateways 
(luminous
 12.2.2) hung at logrotate time overnight. The last message 
logged was:
  > 2017-12-08 03:21:01.897363 7fac46176700  0 ERROR: failed 
to clone shard, completion_mgr.get_next() returned ret=-125

  one of the 3 nodes recorded more detail:
 > 2017-12-08 06:51:04.452108 7f80fbfdf700  1 rgw realm reloader: 
Pausing frontends for realm update...
 > 2017-12-08 06:51:04.452126 7f80fbfdf700  1 rgw realm reloader: 
Frontends paused
 > 2017-12-08 06:51:04.452891 7f8202436700  0 ERROR: failed to 
clone shard, completion_mgr.get_next() returned ret=-125

 I remember seeing this happen on our test cluster a while back with
 Kraken. I can't find the tracker issue I originally found 
related to
 this, but it also sounds like it could be a reversion of bug 
#20339 or

 #20686?
  I recorded some strace output from one of the radosgw 
instances before

 restarting, if it's useful to open an issue.
  --
 Graham Allan
 Minnesota Supercomputing Institute - g...@umn.edu
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade from 12.2.1 to 12.2.2 broke my CephFs

2017-12-11 Thread Tobias Prousa

Hi Zheng,

I did some more tests with cephfs-table-tool. I realized that disaster 
recovery implies to possibly reset inode table completely besides doing 
a session reset using something like


cephfs-table-tool all reset inode

Would that be close to what you suggested? Is it safe to reset complete 
inode table or will that wipe my file system?


Btw. cephfs-table-tool show reset inode gives me ~400k inodes, part of 
them in section 'free', part of them in section 'projected_free'.


Thanks,
Tobi





On 12/11/2017 04:28 PM, Yan, Zheng wrote:
On Mon, Dec 11, 2017 at 11:17 PM, Tobias Prousa 
 wrote:

These are essentially the first commands I did execute, in this exact order.
Additionally I did a:

ceph fs reset cephfs --yes-i-really-mean-it


how many active mds were there before the upgrading.


Any hint on how to find max inode number and do I understand that I should
remove every free-marked inode number that is there except the biggest one
which has to stay?

If you are not sure, you can just try removing 1 inode numbers
from inodetale


How to remove those inodes using cephfs-table-tool?


using cephfs-table-tool take_inos 


--
---
Dipl.-Inf. (FH) Tobias Prousa
Leiter Entwicklung Datenlogger

CAETEC GmbH
Industriestr. 1
D-82140 Olching
www.caetec.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Olching
Handelsregister: Amtsgericht München, HRB 183929
Geschäftsführung: Stephan Bacher, Andreas Wocke

Tel.: +49 (0)8142 / 50 13 60
Fax.: +49 (0)8142 / 50 13 69

eMail: tobias.pro...@caetec.de
Web:   http://www.caetec.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous rgw hangs after sighup

2017-12-11 Thread Graham Allan

That's the issue I remember (#20763)!

The hang happened to me once, on this cluster, after upgrade from jewel 
to 12.2.2; then on Friday I disabled automatic bucket resharding due to 
some other problems - didn't get any logrotate-related hangs through the 
weekend. I wonder if these could be related?


Graham

On 12/11/2017 02:01 AM, Martin Emrich wrote:

Hi!

This sounds like http://tracker.ceph.com/issues/20763 (or indeed 
http://tracker.ceph.com/issues/20866).

It is still present in 12.2.2 (just tried it). My workaround is to exclude radosgw from 
logrotate (remove "radosgw" from /etc/logrotate.d/ceph) from being SIGHUPed, 
and to rotate the logs manually from time to time and completely restarting the radosgw 
processes one after the other on my radosgw cluster.

Regards,

Martin

Am 08.12.17, 18:58 schrieb "ceph-users im Auftrag von Graham Allan" 
:

 I noticed this morning that all four of our rados gateways (luminous
 12.2.2) hung at logrotate time overnight. The last message logged was:
 
 > 2017-12-08 03:21:01.897363 7fac46176700  0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125
 
 one of the 3 nodes recorded more detail:

 > 2017-12-08 06:51:04.452108 7f80fbfdf700  1 rgw realm reloader: Pausing 
frontends for realm update...
 > 2017-12-08 06:51:04.452126 7f80fbfdf700  1 rgw realm reloader: Frontends 
paused
 > 2017-12-08 06:51:04.452891 7f8202436700  0 ERROR: failed to clone shard, 
completion_mgr.get_next() returned ret=-125
 I remember seeing this happen on our test cluster a while back with
 Kraken. I can't find the tracker issue I originally found related to
 this, but it also sounds like it could be a reversion of bug #20339 or
 #20686?
 
 I recorded some strace output from one of the radosgw instances before

 restarting, if it's useful to open an issue.
 
 --

 Graham Allan
 Minnesota Supercomputing Institute - g...@umn.edu
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 



--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to remove a faulty bucket?

2017-12-11 Thread Robin H. Johnson
On Mon, Dec 11, 2017 at 09:29:11AM +, Martin Emrich wrote:
> Hi!
> 
> Am 09.12.17, 00:19 schrieb "Robin H. Johnson" :
> 
> If you use 'radosgw-admin bi list', you can get a listing of the raw 
> bucket
> index. I'll bet that the objects aren't being shown at the S3 layer
> because something is wrong with them. But since they are in the bi-list,
> you'll get 409 BucketNotEmpty.
> 
> Yes indeed. Running "radosgw-admin bi list" results in an incomplete 300MB 
> JSON file, before it freezes.
That's a very good starting point to debug.
The bucket index is stored inside the OMAP area of a raw RADOS object.
(in a filestore OSD it's in the LevelDB),  I wonder if you have
corruption or something else awry. 
How many objects were in this bucket? The number from 'bucket stats' is
a good starting point.

Newer versions of Jewel do report OMAP inconsistency after deep-scrub, so
that would be a help in your case too.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Asst. Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] questions about rbd image

2017-12-11 Thread tim taler
Well connecting an rbd to two servers would be like mapping a block
device from a storage array onto two different hosts,
that's is possible and (was) done.
(it would be much more difficult though to connect a single physical
harddisk to two computers)

The point is that as mentioned above you would need a cluster aware
filesystem on that rbd.
Not and ext4, not a zfs but gfs2 was mentioned above, I tried it once
just for fun using ocfs2 and it worked.

I wouldn't do that though nowadays (especially not in production).
AFAIK "cluster aware" filesysstem came a while BEFORE "clusterd and
distributed" filesystems.

With ocfs2 on rbd (one mount is read/write, second and following
mounts are read only right) you have to take care of two layers,
so two layers of prossible problems.

What you probably really want to have is a "clustered distributed
filesystem" such as cephfs or glusterfs (as to name two of them)
you can have as many replicas as you like, any number of host can
mount and read-write onto such a filesystem and - given the right set
up -
a bunch of hosts can even fail and the "clustered distributed"
filesystem is still available.

hth

On Mon, Dec 11, 2017 at 3:37 PM, David Turner  wrote:
> An RBD can only be mapped to a single client host.  There is no way around
> this.  An RBD at its core is a block device.  Connecting an RBD to 2 servers
> would be like connecting a harddrive to 2 servers.
>
> On Mon, Dec 11, 2017 at 9:13 AM 13605702596 <13605702...@163.com> wrote:
>>
>> hi Jason
>> thanks for your answer.
>> there is one more question, that is:
>> can we use rbd image to share data between two clients? one wirtes data,
>> another just reads?
>>
>> thanks
>>
>>
>> At 2017-12-11 21:52:54, "Jason Dillaman"  wrote:
>> >On Mon, Dec 11, 2017 at 7:50 AM, 13605702...@163.com
>> ><13605702...@163.com> wrote:
>> >> hi
>> >>
>> >> i'm testing on rbd image. the are TWO questions that confused me.
>> >> ceph -v
>> >> ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>> >> uname -r
>> >> 3.10.0-514.el7.x86_64
>> >>
>> >> (1)  does rbd image supports multiple clients to write data
>> >> simultaneously?
>> >
>> >You would need to put a clustered file system like GFS2 on top of the
>> >block device to utilize it concurrently.
>> >
>> >> if it supports, how can share data between several clients using rbd
>> >> image?
>> >> client A: write data to rbd/test
>> >> client B: rbd map, and mount it to /mnt, file can be found in /mnt dir,
>> >> but
>> >> the content is miss.
>> >>
>> >> on monitor:
>> >> rbd create rbd/test -s 1024
>> >> rbd info rbd/test
>> >> rbd image 'test':
>> >> size 1024 MB in 256 objects
>> >> order 22 (4096 kB objects)
>> >> block_name_prefix: rbd_data.121d238e1f29
>> >> format: 2
>> >> features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
>> >> flags:
>> >> then i disble the feature: object-map, fast-diff, deep-flatten
>> >>
>> >> on client A:
>> >> rbd map rbd/test
>> >> mkfs -t xfs /dev/rbd0
>> >> mount /dev/rbd/rbd/test /mnt/
>> >> echo 124 > /mnt/host124
>> >> cat host124
>> >> 124
>> >>
>> >> on client B:
>> >> rbd map rbd/test
>> >> mount /dev/rbd/rbd/test /mnt/
>> >> cat  host124  --> show nothing!
>> >>
>> >> echo 125 > /mnt/host125
>> >> cat /mnt/host125
>> >> 125
>> >>
>> >> then on client C:
>> >> rbd map rbd/test
>> >> mount /dev/rbd/rbd/test /mnt/
>> >> cd /mnt
>> >> cat host124 --> show nothing!
>> >> cat host125 --> show nothing!
>> >>
>> >> (2) does rbd image supports stripping? if does, howto?
>> >
>> >Not yet, but it's a work-in-progress for krbd to support "fancy"
>> >striping (librbd would support it via rbd-nbd).
>> >
>> >> on monitor, i create an image as following:
>> >> rbd create rbd/test --image-feature layering,striping,exclusive-lock
>> >> --size
>> >> 1024 --object-size 4096 --stripe-unit 4096  --stripe-count 2
>> >> stripe unit is not a factor of the object size
>> >> rbd create rbd/test --image-feature layering,striping,exclusive-lock
>> >> --size
>> >> 1024 --object-size 8M --stripe-unit 4M --stripe-count 2
>> >> rbd: the argument ('4M') for option '--unit' is invalid
>> >> i don't know why those cmd fails?
>> >
>> >Only Luminous and later releases support specifying the stripe unit
>> >with B/K/M suffixes.
>> >
>> >> finally, i successed with the following cmd:
>> >> rbd create rbd/test --image-feature layering,striping,exclusive-lock
>> >> --size
>> >> 1024 --object-size 8388608 --stripe-unit 4194304  --stripe-count 2
>> >>
>> >> but whe i map it on client, it fails.
>> >> the error msg:
>> >> rbd: image test: unsupported stripe unit (got 4194304 want 8388608)
>> >>
>> >> best wishes
>> >> thanks
>> >>
>> >> 
>> >> 13605702...@163.com
>> >>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >
>> >
>> >
>> >--
>> >Jason

Re: [ceph-users] Luminous, RGW bucket resharding

2017-12-11 Thread Sam Wouters
On 11-12-17 16:23, Orit Wasserman wrote:
> On Mon, Dec 11, 2017 at 4:58 PM, Sam Wouters  wrote:
>> Hi Orrit,
>>
>>
>> On 04-12-17 18:57, Orit Wasserman wrote:
>>> Hi Andreas,
>>>
>>> On Mon, Dec 4, 2017 at 11:26 AM, Andreas Calminder
>>>  wrote:
 Hello,
 With release 12.2.2 dynamic resharding bucket index has been disabled
 when running a multisite environment
 (http://tracker.ceph.com/issues/21725). Does this mean that resharding
 of bucket indexes shouldn't be done at all, manually, while running
 multisite as there's a risk of corruption?

>>> You will need to stop the sync on the bucket before doing the
>>> resharding and start it again after the resharding completes.
>>> It will start a full sync on the bucket (it doesn't mean we copy the
>>> objects but we go over on all of them to check if the need to be
>>> synced).
>>> We will automate this as part of the reshard admin command in the next
>>> Luminous release.
>> Does this also apply to Jewel? Stop sync and restart after resharding.
>> (I don't know if there is any way to disable sync for a specific bucket.)
>>
> In Jewel we only support offline bucket resharding, you have to stop
> both zones gateways before resharding.
> Do:
> Execute the resharding radosgw-admin command.
> Run full sync on the bucket using: radosgw-admin bucket sync init on the 
> bucket.
> Start the gateways.
>
> This should work but I have not tried it ...
> Regards,
> Orit
Is it necessary to really stop the gateways? We tend to block all
traffic to the bucket being resharded with the use of ACLs in the
haproxy in front, to avoid downtime for non related buckets.

Would a:

- restart gws with sync thread disabled
- block traffic to bucket
- reshard
- unblock traffic
- bucket sync init
- restart gws with sync enabled

work as well?

r,
Sam

>> r,
>> Sam
 Also, as dynamic bucket resharding was/is the main motivator moving to
 Luminous (for me at least) is dynamic reshardning something that is
 planned to be fixed for multisite environments later in the Luminous
 life-cycle or will it be left disabled forever?

>>> We are planning to enable it in Luminous time.
>>>
>>> Regards,
>>> Orit
>>>
 Thanks!
 /andreas
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade from 12.2.1 to 12.2.2 broke my CephFs

2017-12-11 Thread Tobias Prousa

Hi Zheng,


On 12/11/2017 04:28 PM, Yan, Zheng wrote:

On Mon, Dec 11, 2017 at 11:17 PM, Tobias Prousa  wrote:

These are essentially the first commands I did execute, in this exact order.
Additionally I did a:

ceph fs reset cephfs --yes-i-really-mean-it


how many active mds were there before the upgrading.
The CephFS in all the years never ever had more than a single active 
MDS. Might be that
ceph fs reset was obsolete and actually killing all clients happend at 
the same time so this might not have been the change to get it "working" 
again.




Any hint on how to find max inode number and do I understand that I should
remove every free-marked inode number that is there except the biggest one
which has to stay?

If you are not sure, you can just try removing 1 inode numbers
from inodetale
I still do not get the meanings of all that inode removal. Wouldn't 
removing inodes drop files, i.e. data loss? And do those falsely 
free-marked inodes mean that if I start writing to my cephfs (in case I 
would get MDS working stable again) mean that it would write new data to 
inodes that are actually already in use, again coming with data loss? 
Would like to understand what I'm doing before I do it ;)






How to remove those inodes using cephfs-table-tool?


using cephfs-table-tool take_inos 
Is there some documentation to cephfs-table-tool? And which inodes would 
I want to remove? Again, I would like to understand whats happening.


Thank you so much for taking your time. Your help is highly appreciated!

Best regards,
Tobi

--
---
Dipl.-Inf. (FH) Tobias Prousa
Leiter Entwicklung Datenlogger

CAETEC GmbH
Industriestr. 1
D-82140 Olching
www.caetec.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Olching
Handelsregister: Amtsgericht München, HRB 183929
Geschäftsführung: Stephan Bacher, Andreas Wocke

Tel.: +49 (0)8142 / 50 13 60
Fax.: +49 (0)8142 / 50 13 69

eMail: tobias.pro...@caetec.de
Web:   http://www.caetec.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous 12.2.2 traceback (ceph fs status)

2017-12-11 Thread German Anders
Yes, it include all the available pools on the cluster:

*# ceph df*
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
53650G 42928G   10722G 19.99
POOLS:
NAMEID USED  %USED MAX AVAIL OBJECTS
volumes 13 2979G 33.73 5854G  767183
db  18  856G  4.6517563G 1657174
cephfs_data 22   880 0 5854G   6
cephfs_metadata 23  977k 0 5854G  65

*# rados lspools*
volumes
db
cephfs_data
cephfs_metadata

The goods news is that after restarting the ceph-mgr, it started to work :)
but like you said, it would be nice to know how the system got into this.

Thanks a lot John :)

Best,


*German*

2017-12-11 12:17 GMT-03:00 John Spray :

> On Mon, Dec 11, 2017 at 3:13 PM, German Anders 
> wrote:
> > Hi John,
> >
> > how are you? no problem :) . Unfortunately the error on the 'ceph fs
> status'
> > command is still happening:
>
> OK, can you check:
>  - does the "ceph df" output include all the pools?
>  - does restarting ceph-mgr clear the issue?
>
> We probably need to modify this code to handle stats-less pools
> anyway, but I'm curious about how the system got into this state.
>
> John
>
>
> > # ceph fs status
> > Error EINVAL: Traceback (most recent call last):
> >   File "/usr/lib/ceph/mgr/status/module.py", line 301, in handle_command
> > return self.handle_fs_status(cmd)
> >   File "/usr/lib/ceph/mgr/status/module.py", line 219, in
> handle_fs_status
> > stats = pool_stats[pool_id]
> > KeyError: (15L,)
> >
> >
> >
> > German
> > 2017-12-11 12:08 GMT-03:00 John Spray :
> >>
> >> On Mon, Dec 4, 2017 at 6:37 PM, German Anders 
> >> wrote:
> >> > Hi,
> >> >
> >> > I just upgrade a ceph cluster from version 12.2.0 (rc) to 12.2.2
> >> > (stable),
> >> > and i'm getting a traceback while trying to run:
> >> >
> >> > # ceph fs status
> >> >
> >> > Error EINVAL: Traceback (most recent call last):
> >> >   File "/usr/lib/ceph/mgr/status/module.py", line 301, in
> handle_command
> >> > return self.handle_fs_status(cmd)
> >> >   File "/usr/lib/ceph/mgr/status/module.py", line 219, in
> >> > handle_fs_status
> >> > stats = pool_stats[pool_id]
> >> > KeyError: (15L,)
> >> >
> >> >
> >> > # ceph fs ls
> >> > name: cephfs, metadata pool: cephfs_metadata, data pools:
> [cephfs_data ]
> >> >
> >> >
> >> > Any ideas?
> >>
> >> (I'm a bit late but...)
> >>
> >> Is this still happening or did it self-correct?  It could have been
> >> happening when the pool had just been created but the mgr hadn't heard
> >> about any stats from the OSDs about that pool yet (which we should
> >> fix, anyway)
> >>
> >> John
> >>
> >>
> >> >
> >> > Thanks in advance,
> >> >
> >> > Germ
> >> > an
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade from 12.2.1 to 12.2.2 broke my CephFs

2017-12-11 Thread Yan, Zheng
On Mon, Dec 11, 2017 at 11:17 PM, Tobias Prousa  wrote:
>
> These are essentially the first commands I did execute, in this exact order.
> Additionally I did a:
>
> ceph fs reset cephfs --yes-i-really-mean-it
>

how many active mds were there before the upgrading.

>
> Any hint on how to find max inode number and do I understand that I should
> remove every free-marked inode number that is there except the biggest one
> which has to stay?

If you are not sure, you can just try removing 1 inode numbers
from inodetale

>
> How to remove those inodes using cephfs-table-tool?
>

using cephfs-table-tool take_inos 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous, RGW bucket resharding

2017-12-11 Thread Orit Wasserman
On Mon, Dec 11, 2017 at 4:58 PM, Sam Wouters  wrote:
> Hi Orrit,
>
>
> On 04-12-17 18:57, Orit Wasserman wrote:
>> Hi Andreas,
>>
>> On Mon, Dec 4, 2017 at 11:26 AM, Andreas Calminder
>>  wrote:
>>> Hello,
>>> With release 12.2.2 dynamic resharding bucket index has been disabled
>>> when running a multisite environment
>>> (http://tracker.ceph.com/issues/21725). Does this mean that resharding
>>> of bucket indexes shouldn't be done at all, manually, while running
>>> multisite as there's a risk of corruption?
>>>
>> You will need to stop the sync on the bucket before doing the
>> resharding and start it again after the resharding completes.
>> It will start a full sync on the bucket (it doesn't mean we copy the
>> objects but we go over on all of them to check if the need to be
>> synced).
>> We will automate this as part of the reshard admin command in the next
>> Luminous release.
> Does this also apply to Jewel? Stop sync and restart after resharding.
> (I don't know if there is any way to disable sync for a specific bucket.)
>

In Jewel we only support offline bucket resharding, you have to stop
both zones gateways before resharding.
Do:
Execute the resharding radosgw-admin command.
Run full sync on the bucket using: radosgw-admin bucket sync init on the bucket.
Start the gateways.

This should work but I have not tried it ...
Regards,
Orit

> r,
> Sam
>>> Also, as dynamic bucket resharding was/is the main motivator moving to
>>> Luminous (for me at least) is dynamic reshardning something that is
>>> planned to be fixed for multisite environments later in the Luminous
>>> life-cycle or will it be left disabled forever?
>>>
>> We are planning to enable it in Luminous time.
>>
>> Regards,
>> Orit
>>
>>> Thanks!
>>> /andreas
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous 12.2.2 traceback (ceph fs status)

2017-12-11 Thread John Spray
On Mon, Dec 11, 2017 at 3:13 PM, German Anders  wrote:
> Hi John,
>
> how are you? no problem :) . Unfortunately the error on the 'ceph fs status'
> command is still happening:

OK, can you check:
 - does the "ceph df" output include all the pools?
 - does restarting ceph-mgr clear the issue?

We probably need to modify this code to handle stats-less pools
anyway, but I'm curious about how the system got into this state.

John


> # ceph fs status
> Error EINVAL: Traceback (most recent call last):
>   File "/usr/lib/ceph/mgr/status/module.py", line 301, in handle_command
> return self.handle_fs_status(cmd)
>   File "/usr/lib/ceph/mgr/status/module.py", line 219, in handle_fs_status
> stats = pool_stats[pool_id]
> KeyError: (15L,)
>
>
>
> German
> 2017-12-11 12:08 GMT-03:00 John Spray :
>>
>> On Mon, Dec 4, 2017 at 6:37 PM, German Anders 
>> wrote:
>> > Hi,
>> >
>> > I just upgrade a ceph cluster from version 12.2.0 (rc) to 12.2.2
>> > (stable),
>> > and i'm getting a traceback while trying to run:
>> >
>> > # ceph fs status
>> >
>> > Error EINVAL: Traceback (most recent call last):
>> >   File "/usr/lib/ceph/mgr/status/module.py", line 301, in handle_command
>> > return self.handle_fs_status(cmd)
>> >   File "/usr/lib/ceph/mgr/status/module.py", line 219, in
>> > handle_fs_status
>> > stats = pool_stats[pool_id]
>> > KeyError: (15L,)
>> >
>> >
>> > # ceph fs ls
>> > name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
>> >
>> >
>> > Any ideas?
>>
>> (I'm a bit late but...)
>>
>> Is this still happening or did it self-correct?  It could have been
>> happening when the pool had just been created but the mgr hadn't heard
>> about any stats from the OSDs about that pool yet (which we should
>> fix, anyway)
>>
>> John
>>
>>
>> >
>> > Thanks in advance,
>> >
>> > Germ
>> > an
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade from 12.2.1 to 12.2.2 broke my CephFs

2017-12-11 Thread Tobias Prousa



On 12/11/2017 04:05 PM, Yan, Zheng wrote:

On Mon, Dec 11, 2017 at 10:13 PM, Tobias Prousa  wrote:

Hi there,

I'm running a CEPH cluster for some libvirt VMs and a CephFS providing /home
to ~20 desktop machines. There are 4 Hosts running 4 MONs, 4MGRs, 3MDSs (1
active, 2 standby) and 28 OSDs in total. This cluster is up and running
since the days of Bobtail (yes, including CephFS).

Now with update from 12.2.1 to 12.2.2 on last friday afternoon I restarted
MONs, MGRs, OSDs as usual. RBD is running just fine. But after trying to
restart MDSs they tried replaying journal then fell back to standby and FS
was in state "damaged". I finally got them back working after I did a good
portion of whats described here:

http://docs.ceph.com/docs/master/cephfs/disaster-recovery/

What commands did you run? you need to run following commands.

cephfs-journal-tool event recover_dentries summary
cephfs-journal-tool journal reset
cephfs-table-tool all reset session
These are essentially the first commands I did execute, in this exact 
order. Additionally I did a:


ceph fs reset cephfs--yes-i-really-mean-it

Which then was the moment when I was able to restart MDSs for the first 
time back on friday, IIRC.





Now when all clients are shut down I can start MDS, will replay and become
active. I then can mount CephFS on a client and can access my files and
folders. But the more clients I bring up MDS will first report damaged
metadata (probably due to some damaged paths, I could live with that) and
then MDS will fail with assert:

/build/ceph-12.2.2/src/mds/MDCache.cc: 258: FAILED
assert(inode_map.count(in->vino()) == 0)

I tried doing an online CephFS scrub like

ceph daemon mds.a scrub_path / recursive repair

This will run for couple of hours, always finding exactly 10001 damages of
type "backtrace" and reporting it would be fixing loads of erronously
free-marked inodes until MDS crashes. When I rerun that scrub after having
killed all clients and restarted MDSs things will repeat finding exactly
those 10001 damages and it will begin fixing those exactly same free-marked
inodes over again.

Find max inode number of these free-marked inodes, then use
cephfs-table-tool to remove inode numbers that are smaller than the
max number. you can remove a little more just in case.  Before doing
this, you should to stop mds and run "cephfs-table-tool all reset
session".

If everything goes right, mds will no longer trigger the assertion.
Any hint on how to find max inode number and do I understand that I 
should remove every free-marked inode number that is there except the 
biggest one which has to stay?


How to remove those inodes using cephfs-table-tool?




Btw. CephFS has about 3 million objects in metadata pool. Data pool is about
30 million objects with ~2.5TB * 3 replicas.

What I tried next is keeping MDS down and doing

cephfs-data-scan scan_extents 
cephfs-data-scan scan_inodes 
cephfs-data-scan scan_links

As this is described to take "a very long time" this is what I initially
skipped from disater-recovery tips. Right now I'm still on first step with 6
workers on a single host busy doing cephfs-data-scan scan_extents. ceph -s
shows me client io of 20kB/s (!!!). If thats real scan speed this is going
to take ages.
Any way to tell how long this is going to take? Could I speed things up by
running more workers on multiple hosts simultaneously?
Should I abort it as I actually don't have the problem of lost files. Maybe
running cephfs-data-scan scan_links would better suit my issue, or does
scan_extents/scan_indoes HAVE to be run and finished first?

I have to get this cluster up and running again as soon as possible. Any
help highly appreciated. If there is anything I can help, e.g. with further
information, feel free to ask. I'll try to hang around on #ceph (nick
topro/topro_/topro__). FYI, I'm in Central Europe TimeZone (UTC+1).

Thank you so much!

Best regards,
Tobi

--
---
Dipl.-Inf. (FH) Tobias Prousa
Leiter Entwicklung Datenlogger

CAETEC GmbH
Industriestr. 1
D-82140 Olching
www.caetec.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Olching
Handelsregister: Amtsgericht München, HRB 183929
Geschäftsführung: Stephan Bacher, Andreas Wocke

Tel.: +49 (0)8142 / 50 13 60
Fax.: +49 (0)8142 / 50 13 69

eMail: tobias.pro...@caetec.de
Web:   http://www.caetec.de



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
---
Dipl.-Inf. (FH) Tobias Prousa
Leiter Entwicklung Datenlogger

CAETEC GmbH
Industriestr. 1
D-82140 Olching
www.caetec.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Olching
Handelsregister: Amtsgericht München, HRB 183929
Geschäftsführung: Stephan Bacher, Andreas Wocke


Re: [ceph-users] Upgrade from 12.2.1 to 12.2.2 broke my CephFs

2017-12-11 Thread Yan, Zheng
On Mon, Dec 11, 2017 at 10:13 PM, Tobias Prousa  wrote:
> Hi there,
>
> I'm running a CEPH cluster for some libvirt VMs and a CephFS providing /home
> to ~20 desktop machines. There are 4 Hosts running 4 MONs, 4MGRs, 3MDSs (1
> active, 2 standby) and 28 OSDs in total. This cluster is up and running
> since the days of Bobtail (yes, including CephFS).
>
> Now with update from 12.2.1 to 12.2.2 on last friday afternoon I restarted
> MONs, MGRs, OSDs as usual. RBD is running just fine. But after trying to
> restart MDSs they tried replaying journal then fell back to standby and FS
> was in state "damaged". I finally got them back working after I did a good
> portion of whats described here:
>
> http://docs.ceph.com/docs/master/cephfs/disaster-recovery/
>
> Now when all clients are shut down I can start MDS, will replay and become
> active. I then can mount CephFS on a client and can access my files and
> folders. But the more clients I bring up MDS will first report damaged
> metadata (probably due to some damaged paths, I could live with that) and
> then MDS will fail with assert:
>
> /build/ceph-12.2.2/src/mds/MDCache.cc: 258: FAILED
> assert(inode_map.count(in->vino()) == 0)
>
> I tried doing an online CephFS scrub like
>
> ceph daemon mds.a scrub_path / recursive repair
>
> This will run for couple of hours, always finding exactly 10001 damages of
> type "backtrace" and reporting it would be fixing loads of erronously
> free-marked inodes until MDS crashes. When I rerun that scrub after having
> killed all clients and restarted MDSs things will repeat finding exactly
> those 10001 damages and it will begin fixing those exactly same free-marked
> inodes over again.
>
> Btw. CephFS has about 3 million objects in metadata pool. Data pool is about
> 30 million objects with ~2.5TB * 3 replicas.
>
> What I tried next is keeping MDS down and doing
>
> cephfs-data-scan scan_extents 
> cephfs-data-scan scan_inodes 
> cephfs-data-scan scan_links
>
> As this is described to take "a very long time" this is what I initially
> skipped from disater-recovery tips. Right now I'm still on first step with 6
> workers on a single host busy doing cephfs-data-scan scan_extents. ceph -s
> shows me client io of 20kB/s (!!!). If thats real scan speed this is going
> to take ages.
> Any way to tell how long this is going to take? Could I speed things up by
> running more workers on multiple hosts simultaneously?
> Should I abort it as I actually don't have the problem of lost files. Maybe
> running cephfs-data-scan scan_links would better suit my issue, or does
> scan_extents/scan_indoes HAVE to be run and finished first?
>


you can interrupt scan_extents safely.


> I have to get this cluster up and running again as soon as possible. Any
> help highly appreciated. If there is anything I can help, e.g. with further
> information, feel free to ask. I'll try to hang around on #ceph (nick
> topro/topro_/topro__). FYI, I'm in Central Europe TimeZone (UTC+1).
>
> Thank you so much!
>
> Best regards,
> Tobi
>
> --
> ---
> Dipl.-Inf. (FH) Tobias Prousa
> Leiter Entwicklung Datenlogger
>
> CAETEC GmbH
> Industriestr. 1
> D-82140 Olching
> www.caetec.de
>
> Gesellschaft mit beschränkter Haftung
> Sitz der Gesellschaft: Olching
> Handelsregister: Amtsgericht München, HRB 183929
> Geschäftsführung: Stephan Bacher, Andreas Wocke
>
> Tel.: +49 (0)8142 / 50 13 60
> Fax.: +49 (0)8142 / 50 13 69
>
> eMail: tobias.pro...@caetec.de
> Web:   http://www.caetec.de
> 
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous 12.2.2 traceback (ceph fs status)

2017-12-11 Thread German Anders
Hi John,

how are you? no problem :) . Unfortunately the error on the 'ceph fs
status' command is still happening:

*# ceph fs status*
Error EINVAL: Traceback (most recent call last):
  File "/usr/lib/ceph/mgr/status/module.py", line 301, in handle_command
return self.handle_fs_status(cmd)
  File "/usr/lib/ceph/mgr/status/module.py", line 219, in handle_fs_status
stats = pool_stats[pool_id]
KeyError: (15L,)



*German*
2017-12-11 12:08 GMT-03:00 John Spray :

> On Mon, Dec 4, 2017 at 6:37 PM, German Anders 
> wrote:
> > Hi,
> >
> > I just upgrade a ceph cluster from version 12.2.0 (rc) to 12.2.2
> (stable),
> > and i'm getting a traceback while trying to run:
> >
> > # ceph fs status
> >
> > Error EINVAL: Traceback (most recent call last):
> >   File "/usr/lib/ceph/mgr/status/module.py", line 301, in handle_command
> > return self.handle_fs_status(cmd)
> >   File "/usr/lib/ceph/mgr/status/module.py", line 219, in
> handle_fs_status
> > stats = pool_stats[pool_id]
> > KeyError: (15L,)
> >
> >
> > # ceph fs ls
> > name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
> >
> >
> > Any ideas?
>
> (I'm a bit late but...)
>
> Is this still happening or did it self-correct?  It could have been
> happening when the pool had just been created but the mgr hadn't heard
> about any stats from the OSDs about that pool yet (which we should
> fix, anyway)
>
> John
>
>
> >
> > Thanks in advance,
> >
> > Germ
> > an
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous 12.2.2 traceback (ceph fs status)

2017-12-11 Thread John Spray
On Mon, Dec 4, 2017 at 6:37 PM, German Anders  wrote:
> Hi,
>
> I just upgrade a ceph cluster from version 12.2.0 (rc) to 12.2.2 (stable),
> and i'm getting a traceback while trying to run:
>
> # ceph fs status
>
> Error EINVAL: Traceback (most recent call last):
>   File "/usr/lib/ceph/mgr/status/module.py", line 301, in handle_command
> return self.handle_fs_status(cmd)
>   File "/usr/lib/ceph/mgr/status/module.py", line 219, in handle_fs_status
> stats = pool_stats[pool_id]
> KeyError: (15L,)
>
>
> # ceph fs ls
> name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
>
>
> Any ideas?

(I'm a bit late but...)

Is this still happening or did it self-correct?  It could have been
happening when the pool had just been created but the mgr hadn't heard
about any stats from the OSDs about that pool yet (which we should
fix, anyway)

John


>
> Thanks in advance,
>
> Germ
> an
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade from 12.2.1 to 12.2.2 broke my CephFs

2017-12-11 Thread Yan, Zheng
On Mon, Dec 11, 2017 at 10:13 PM, Tobias Prousa  wrote:
> Hi there,
>
> I'm running a CEPH cluster for some libvirt VMs and a CephFS providing /home
> to ~20 desktop machines. There are 4 Hosts running 4 MONs, 4MGRs, 3MDSs (1
> active, 2 standby) and 28 OSDs in total. This cluster is up and running
> since the days of Bobtail (yes, including CephFS).
>
> Now with update from 12.2.1 to 12.2.2 on last friday afternoon I restarted
> MONs, MGRs, OSDs as usual. RBD is running just fine. But after trying to
> restart MDSs they tried replaying journal then fell back to standby and FS
> was in state "damaged". I finally got them back working after I did a good
> portion of whats described here:
>
> http://docs.ceph.com/docs/master/cephfs/disaster-recovery/

What commands did you run? you need to run following commands.

cephfs-journal-tool event recover_dentries summary
cephfs-journal-tool journal reset
cephfs-table-tool all reset session


>
> Now when all clients are shut down I can start MDS, will replay and become
> active. I then can mount CephFS on a client and can access my files and
> folders. But the more clients I bring up MDS will first report damaged
> metadata (probably due to some damaged paths, I could live with that) and
> then MDS will fail with assert:
>
> /build/ceph-12.2.2/src/mds/MDCache.cc: 258: FAILED
> assert(inode_map.count(in->vino()) == 0)
>
> I tried doing an online CephFS scrub like
>
> ceph daemon mds.a scrub_path / recursive repair
>
> This will run for couple of hours, always finding exactly 10001 damages of
> type "backtrace" and reporting it would be fixing loads of erronously
> free-marked inodes until MDS crashes. When I rerun that scrub after having
> killed all clients and restarted MDSs things will repeat finding exactly
> those 10001 damages and it will begin fixing those exactly same free-marked
> inodes over again.

Find max inode number of these free-marked inodes, then use
cephfs-table-tool to remove inode numbers that are smaller than the
max number. you can remove a little more just in case.  Before doing
this, you should to stop mds and run "cephfs-table-tool all reset
session".

If everything goes right, mds will no longer trigger the assertion.


>
> Btw. CephFS has about 3 million objects in metadata pool. Data pool is about
> 30 million objects with ~2.5TB * 3 replicas.
>
> What I tried next is keeping MDS down and doing
>
> cephfs-data-scan scan_extents 
> cephfs-data-scan scan_inodes 
> cephfs-data-scan scan_links
>
> As this is described to take "a very long time" this is what I initially
> skipped from disater-recovery tips. Right now I'm still on first step with 6
> workers on a single host busy doing cephfs-data-scan scan_extents. ceph -s
> shows me client io of 20kB/s (!!!). If thats real scan speed this is going
> to take ages.
> Any way to tell how long this is going to take? Could I speed things up by
> running more workers on multiple hosts simultaneously?
> Should I abort it as I actually don't have the problem of lost files. Maybe
> running cephfs-data-scan scan_links would better suit my issue, or does
> scan_extents/scan_indoes HAVE to be run and finished first?
>
> I have to get this cluster up and running again as soon as possible. Any
> help highly appreciated. If there is anything I can help, e.g. with further
> information, feel free to ask. I'll try to hang around on #ceph (nick
> topro/topro_/topro__). FYI, I'm in Central Europe TimeZone (UTC+1).
>
> Thank you so much!
>
> Best regards,
> Tobi
>
> --
> ---
> Dipl.-Inf. (FH) Tobias Prousa
> Leiter Entwicklung Datenlogger
>
> CAETEC GmbH
> Industriestr. 1
> D-82140 Olching
> www.caetec.de
>
> Gesellschaft mit beschränkter Haftung
> Sitz der Gesellschaft: Olching
> Handelsregister: Amtsgericht München, HRB 183929
> Geschäftsführung: Stephan Bacher, Andreas Wocke
>
> Tel.: +49 (0)8142 / 50 13 60
> Fax.: +49 (0)8142 / 50 13 69
>
> eMail: tobias.pro...@caetec.de
> Web:   http://www.caetec.de
> 
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous, RGW bucket resharding

2017-12-11 Thread Sam Wouters
Hi Orrit,


On 04-12-17 18:57, Orit Wasserman wrote:
> Hi Andreas,
>
> On Mon, Dec 4, 2017 at 11:26 AM, Andreas Calminder
>  wrote:
>> Hello,
>> With release 12.2.2 dynamic resharding bucket index has been disabled
>> when running a multisite environment
>> (http://tracker.ceph.com/issues/21725). Does this mean that resharding
>> of bucket indexes shouldn't be done at all, manually, while running
>> multisite as there's a risk of corruption?
>>
> You will need to stop the sync on the bucket before doing the
> resharding and start it again after the resharding completes.
> It will start a full sync on the bucket (it doesn't mean we copy the
> objects but we go over on all of them to check if the need to be
> synced).
> We will automate this as part of the reshard admin command in the next
> Luminous release.
Does this also apply to Jewel? Stop sync and restart after resharding.
(I don't know if there is any way to disable sync for a specific bucket.)

r,
Sam
>> Also, as dynamic bucket resharding was/is the main motivator moving to
>> Luminous (for me at least) is dynamic reshardning something that is
>> planned to be fixed for multisite environments later in the Luminous
>> life-cycle or will it be left disabled forever?
>>
> We are planning to enable it in Luminous time.
>
> Regards,
> Orit
>
>> Thanks!
>> /andreas
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stuck down+peering after host failure.

2017-12-11 Thread Denes Dolhay

Hi,

I found another possible cause for your problem:

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#placement-group-down-peering-failure


I hope that I helped,
Denes.


On 12/11/2017 03:43 PM, Denes Dolhay wrote:


Hi Aaron!


There is an previous post about safely shutting down and restarting a 
cluster:


http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017378.html


To the problems at hand:

What size were you using?

Ceph can only obey the failure domain if it knows exactly which osd is 
on which host. So are you sure that there were no errors in the "ceph 
osd tree"? Maybe an osd physically in the offline node was logically 
placed to another host in the tree?



I think you should:

-Query a downed pg, determine its acting group "ceph pg map {pg-num}" 
and compare it to the tree


-Try to make the offline host boot up, or if it was damaged then add 
it's osds to another host



If this did not help, then please include a ceph health, ceph osd map, 
ceph pg map {faulty pg-num}, ceph pg {faulty pg-num} query



I hope that I helped,
Denes.


On 12/11/2017 03:02 PM, Aaron Bassett wrote:

Morning All,
I have a large-ish (16 node, 1100 osds) cluster I recent had to move from one DC to another. 
Before shutting everything down, I set noout, norecover, and nobackfill, thinking this would 
help everything stand back up again. Upon installation at the new DC, one of the nodes refused 
to boot. With my crush rule having the failure domain as host, I did not think this would be a 
problem. However, once I turned off noout, norecover, and nobackfille, everything else came up 
and settled in, I still have 1545 pgs stuck down+peering. On other pgs, recovery and 
backfilling are proceeding as expected, but these pgs appear to be permanently stuck. When 
querying the down+peering pgs, they all mention pgs from the down node in 
""down_osds_we_would_probe". I'm not sure why it *needs* to query these since 
it should have two other copies on other nodes? I'm not sure if bringing everything up with 
noout or norecover on confused things. Looking for advice...

Aaron
CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stuck down+peering after host failure.

2017-12-11 Thread Denes Dolhay

Hi Aaron!


There is an previous post about safely shutting down and restarting a 
cluster:


http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017378.html


To the problems at hand:

What size were you using?

Ceph can only obey the failure domain if it knows exactly which osd is 
on which host. So are you sure that there were no errors in the "ceph 
osd tree"? Maybe an osd physically in the offline node was logically 
placed to another host in the tree?



I think you should:

-Query a downed pg, determine its acting group "ceph pg map {pg-num}" 
and compare it to the tree


-Try to make the offline host boot up, or if it was damaged then add 
it's osds to another host



If this did not help, then please include a ceph health, ceph osd map, 
ceph pg map {faulty pg-num}, ceph pg {faulty pg-num} query



I hope that I helped,
Denes.


On 12/11/2017 03:02 PM, Aaron Bassett wrote:

Morning All,
I have a large-ish (16 node, 1100 osds) cluster I recent had to move from one DC to another. 
Before shutting everything down, I set noout, norecover, and nobackfill, thinking this would 
help everything stand back up again. Upon installation at the new DC, one of the nodes refused 
to boot. With my crush rule having the failure domain as host, I did not think this would be a 
problem. However, once I turned off noout, norecover, and nobackfille, everything else came up 
and settled in, I still have 1545 pgs stuck down+peering. On other pgs, recovery and 
backfilling are proceeding as expected, but these pgs appear to be permanently stuck. When 
querying the down+peering pgs, they all mention pgs from the down node in 
""down_osds_we_would_probe". I'm not sure why it *needs* to query these since 
it should have two other copies on other nodes? I'm not sure if bringing everything up with 
noout or norecover on confused things. Looking for advice...

Aaron
CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] questions about rbd image

2017-12-11 Thread David Turner
An RBD can only be mapped to a single client host.  There is no way around
this.  An RBD at its core is a block device.  Connecting an RBD to 2
servers would be like connecting a harddrive to 2 servers.

On Mon, Dec 11, 2017 at 9:13 AM 13605702596 <13605702...@163.com> wrote:

> hi Jason
> thanks for your answer.
> there is one more question, that is:
> can we use rbd image to share data between two clients? one wirtes data, 
> another just reads?
>
> thanks
>
>
> At 2017-12-11 21:52:54, "Jason Dillaman"  wrote:
> >On Mon, Dec 11, 2017 at 7:50 AM, 13605702...@163.com
> ><13605702...@163.com> wrote:
> >> hi
> >>
> >> i'm testing on rbd image. the are TWO questions that confused me.
> >> ceph -v
> >> ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
> >> uname -r
> >> 3.10.0-514.el7.x86_64
> >>
> >> (1)  does rbd image supports multiple clients to write data simultaneously?
> >
> >You would need to put a clustered file system like GFS2 on top of the
> >block device to utilize it concurrently.
> >
> >> if it supports, how can share data between several clients using rbd image?
> >> client A: write data to rbd/test
> >> client B: rbd map, and mount it to /mnt, file can be found in /mnt dir, but
> >> the content is miss.
> >>
> >> on monitor:
> >> rbd create rbd/test -s 1024
> >> rbd info rbd/test
> >> rbd image 'test':
> >> size 1024 MB in 256 objects
> >> order 22 (4096 kB objects)
> >> block_name_prefix: rbd_data.121d238e1f29
> >> format: 2
> >> features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
> >> flags:
> >> then i disble the feature: object-map, fast-diff, deep-flatten
> >>
> >> on client A:
> >> rbd map rbd/test
> >> mkfs -t xfs /dev/rbd0
> >> mount /dev/rbd/rbd/test /mnt/
> >> echo 124 > /mnt/host124
> >> cat host124
> >> 124
> >>
> >> on client B:
> >> rbd map rbd/test
> >> mount /dev/rbd/rbd/test /mnt/
> >> cat  host124  --> show nothing!
> >>
> >> echo 125 > /mnt/host125
> >> cat /mnt/host125
> >> 125
> >>
> >> then on client C:
> >> rbd map rbd/test
> >> mount /dev/rbd/rbd/test /mnt/
> >> cd /mnt
> >> cat host124 --> show nothing!
> >> cat host125 --> show nothing!
> >>
> >> (2) does rbd image supports stripping? if does, howto?
> >
> >Not yet, but it's a work-in-progress for krbd to support "fancy"
> >striping (librbd would support it via rbd-nbd).
> >
> >> on monitor, i create an image as following:
> >> rbd create rbd/test --image-feature layering,striping,exclusive-lock --size
> >> 1024 --object-size 4096 --stripe-unit 4096  --stripe-count 2
> >> stripe unit is not a factor of the object size
> >> rbd create rbd/test --image-feature layering,striping,exclusive-lock --size
> >> 1024 --object-size 8M --stripe-unit 4M --stripe-count 2
> >> rbd: the argument ('4M') for option '--unit' is invalid
> >> i don't know why those cmd fails?
> >
> >Only Luminous and later releases support specifying the stripe unit
> >with B/K/M suffixes.
> >
> >> finally, i successed with the following cmd:
> >> rbd create rbd/test --image-feature layering,striping,exclusive-lock --size
> >> 1024 --object-size 8388608 --stripe-unit 4194304  --stripe-count 2
> >>
> >> but whe i map it on client, it fails.
> >> the error msg:
> >> rbd: image test: unsupported stripe unit (got 4194304 want 8388608)
> >>
> >> best wishes
> >> thanks
> >>
> >> 
> >> 13605702...@163.com
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >
> >
> >--
> >Jason
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corrupted files on CephFS since Luminous upgrade

2017-12-11 Thread Denes Dolhay

Hi,


The ceph mds keeps all the capabilities for the files, however the 
clients modify the the rados data pool objects directly (they do not do 
the content modification threw the mds).


IMHO IF the file (really) gets corrupted because a client write (not 
some corruption from the mds / osd) then it can only happen if:


-The first client does not request write cap (lock) for that object 
before write


-OR the MDS does not store that write cap

-OR the MDS does not return the cap to the second client or refuse the 
write for the second concurrent client


-OR the second client does not (request a write cap / check existing 
caps / does not obey result for write request denial from the mds)


-OR any of the clients write incorrect data based on obsolete object 
cache caused by missing / faulty cache eviction (Is this even possible?)


*Please correct me if I am wrong in any of the above!!*


If I were in your shoe, first I would test the locking of the cephfs by 
writing two test scripts:


-One would constantly append to a file (like an SMTP server does a mailbox)

-The other would modify / add / delete parts o this file (like a imap 
server does)


And wait for corruption to occur


One other thing, it would be interesting to see what the corruption 
really looks like for example partially overwritten lines?


It would be interesting to know at what part of the file is the 
corruption in? beginning? end? %?


Aaand if there were a mailbox compact around the corruption


Kind regards,

Denes.



On 12/11/2017 10:33 AM, Florent B wrote:

On 08/12/2017 14:59, Ronny Aasen wrote:

On 08. des. 2017 14:49, Florent B wrote:

On 08/12/2017 14:29, Yan, Zheng wrote:

On Fri, Dec 8, 2017 at 6:51 PM, Florent B  wrote:

I don't know I didn't touched that setting. Which one is recommended ?



If multiple dovecot instances are running at the same time and they
all modify the same files. you need to set fuse_disable_pagecache to
true.

Ok, but in my configuration, each mail user is mapped to a single
server.
So files are accessed only by a single server at a time.


how about mail delivery ? if you use dovecot deliver a delivery can
occur (and rewrite dovecot index/cache) at the same time as a user
accesses imap and writes to dovecot index/cache.


Ok why not, but I never had problem like this with previous version of
Ceph. I will try fuse_disable_pagecache...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] questions about rbd image

2017-12-11 Thread 13605702596
hi Jasonthanks for your answer.there is one more question, that is: can we use 
rbd image to share data between two clients? one wirtes data, another just 
reads?

thanks


At 2017-12-11 21:52:54, "Jason Dillaman"  wrote:
>On Mon, Dec 11, 2017 at 7:50 AM, 13605702...@163.com
><13605702...@163.com> wrote:
>> hi
>>
>> i'm testing on rbd image. the are TWO questions that confused me.
>> ceph -v
>> ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>> uname -r
>> 3.10.0-514.el7.x86_64
>>
>> (1)  does rbd image supports multiple clients to write data simultaneously?
>
>You would need to put a clustered file system like GFS2 on top of the
>block device to utilize it concurrently.
>
>> if it supports, how can share data between several clients using rbd image?
>> client A: write data to rbd/test
>> client B: rbd map, and mount it to /mnt, file can be found in /mnt dir, but
>> the content is miss.
>>
>> on monitor:
>> rbd create rbd/test -s 1024
>> rbd info rbd/test
>> rbd image 'test':
>> size 1024 MB in 256 objects
>> order 22 (4096 kB objects)
>> block_name_prefix: rbd_data.121d238e1f29
>> format: 2
>> features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
>> flags:
>> then i disble the feature: object-map, fast-diff, deep-flatten
>>
>> on client A:
>> rbd map rbd/test
>> mkfs -t xfs /dev/rbd0
>> mount /dev/rbd/rbd/test /mnt/
>> echo 124 > /mnt/host124
>> cat host124
>> 124
>>
>> on client B:
>> rbd map rbd/test
>> mount /dev/rbd/rbd/test /mnt/
>> cat  host124  --> show nothing!
>>
>> echo 125 > /mnt/host125
>> cat /mnt/host125
>> 125
>>
>> then on client C:
>> rbd map rbd/test
>> mount /dev/rbd/rbd/test /mnt/
>> cd /mnt
>> cat host124 --> show nothing!
>> cat host125 --> show nothing!
>>
>> (2) does rbd image supports stripping? if does, howto?
>
>Not yet, but it's a work-in-progress for krbd to support "fancy"
>striping (librbd would support it via rbd-nbd).
>
>> on monitor, i create an image as following:
>> rbd create rbd/test --image-feature layering,striping,exclusive-lock --size
>> 1024 --object-size 4096 --stripe-unit 4096  --stripe-count 2
>> stripe unit is not a factor of the object size
>> rbd create rbd/test --image-feature layering,striping,exclusive-lock --size
>> 1024 --object-size 8M --stripe-unit 4M --stripe-count 2
>> rbd: the argument ('4M') for option '--unit' is invalid
>> i don't know why those cmd fails?
>
>Only Luminous and later releases support specifying the stripe unit
>with B/K/M suffixes.
>
>> finally, i successed with the following cmd:
>> rbd create rbd/test --image-feature layering,striping,exclusive-lock --size
>> 1024 --object-size 8388608 --stripe-unit 4194304  --stripe-count 2
>>
>> but whe i map it on client, it fails.
>> the error msg:
>> rbd: image test: unsupported stripe unit (got 4194304 want 8388608)
>>
>> best wishes
>> thanks
>>
>> 
>> 13605702...@163.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
>-- 
>Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upgrade from 12.2.1 to 12.2.2 broke my CephFs

2017-12-11 Thread Tobias Prousa

Hi there,

I'm running a CEPH cluster for some libvirt VMs and a CephFS providing 
/home to ~20 desktop machines. There are 4 Hosts running 4 MONs, 4MGRs, 
3MDSs (1 active, 2 standby) and 28 OSDs in total. This cluster is up and 
running since the days of Bobtail (yes, including CephFS).


Now with update from 12.2.1 to 12.2.2 on last friday afternoon I 
restarted MONs, MGRs, OSDs as usual. RBD is running just fine. But after 
trying to restart MDSs they tried replaying journal then fell back to 
standby and FS was in state "damaged". I finally got them back working 
after I did a good portion of whats described here:


http://docs.ceph.com/docs/master/cephfs/disaster-recovery/

Now when all clients are shut down I can start MDS, will replay and 
become active. I then can mount CephFS on a client and can access my 
files and folders. But the more clients I bring up MDS will first report 
damaged metadata (probably due to some damaged paths, I could live with 
that) and then MDS will fail with assert:


/build/ceph-12.2.2/src/mds/MDCache.cc: 258: FAILED 
assert(inode_map.count(in->vino()) == 0)


I tried doing an online CephFS scrub like

ceph daemon mds.a scrub_path / recursive repair

This will run for couple of hours, always finding exactly 10001 damages 
of type "backtrace" and reporting it would be fixing loads of erronously 
free-marked inodes until MDS crashes. When I rerun that scrub after 
having killed all clients and restarted MDSs things will repeat finding 
exactly those 10001 damages and it will begin fixing those exactly same 
free-marked inodes over again.


Btw. CephFS has about 3 million objects in metadata pool. Data pool is 
about 30 million objects with ~2.5TB * 3 replicas.


What I tried next is keeping MDS down and doing

cephfs-data-scan scan_extents 
cephfs-data-scan scan_inodes 
cephfs-data-scan scan_links

As this is described to take "a very long time" this is what I initially 
skipped from disater-recovery tips. Right now I'm still on first step 
with 6 workers on a single host busy doing cephfs-data-scan 
scan_extents. ceph -s shows me client io of 20kB/s (!!!). If thats real 
scan speed this is going to take ages.
Any way to tell how long this is going to take? Could I speed things up 
by running more workers on multiple hosts simultaneously?
Should I abort it as I actually don't have the problem of lost files. 
Maybe running cephfs-data-scan scan_links would better suit my issue, or 
does scan_extents/scan_indoes HAVE to be run and finished first?


I have to get this cluster up and running again as soon as possible. Any 
help highly appreciated. If there is anything I can help, e.g. with 
further information, feel free to ask. I'll try to hang around on #ceph 
(nick topro/topro_/topro__). FYI, I'm in Central Europe TimeZone (UTC+1).


Thank you so much!

Best regards,
Tobi

--
---
Dipl.-Inf. (FH) Tobias Prousa
Leiter Entwicklung Datenlogger

CAETEC GmbH
Industriestr. 1
D-82140 Olching
www.caetec.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Olching
Handelsregister: Amtsgericht München, HRB 183929
Geschäftsführung: Stephan Bacher, Andreas Wocke

Tel.: +49 (0)8142 / 50 13 60
Fax.: +49 (0)8142 / 50 13 69

eMail: tobias.pro...@caetec.de
Web:   http://www.caetec.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Stuck down+peering after host failure.

2017-12-11 Thread Aaron Bassett
Morning All,
I have a large-ish (16 node, 1100 osds) cluster I recent had to move from one 
DC to another. Before shutting everything down, I set noout, norecover, and 
nobackfill, thinking this would help everything stand back up again. Upon 
installation at the new DC, one of the nodes refused to boot. With my crush 
rule having the failure domain as host, I did not think this would be a 
problem. However, once I turned off noout, norecover, and nobackfille, 
everything else came up and settled in, I still have 1545 pgs stuck 
down+peering. On other pgs, recovery and backfilling are proceeding as 
expected, but these pgs appear to be permanently stuck. When querying the 
down+peering pgs, they all mention pgs from the down node in 
""down_osds_we_would_probe". I'm not sure why it *needs* to query these since 
it should have two other copies on other nodes? I'm not sure if bringing 
everything up with noout or norecover on confused things. Looking for advice...

Aaron
CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] questions about rbd image

2017-12-11 Thread Jason Dillaman
On Mon, Dec 11, 2017 at 7:50 AM, 13605702...@163.com
<13605702...@163.com> wrote:
> hi
>
> i'm testing on rbd image. the are TWO questions that confused me.
> ceph -v
> ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
> uname -r
> 3.10.0-514.el7.x86_64
>
> (1)  does rbd image supports multiple clients to write data simultaneously?

You would need to put a clustered file system like GFS2 on top of the
block device to utilize it concurrently.

> if it supports, how can share data between several clients using rbd image?
> client A: write data to rbd/test
> client B: rbd map, and mount it to /mnt, file can be found in /mnt dir, but
> the content is miss.
>
> on monitor:
> rbd create rbd/test -s 1024
> rbd info rbd/test
> rbd image 'test':
> size 1024 MB in 256 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.121d238e1f29
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
> flags:
> then i disble the feature: object-map, fast-diff, deep-flatten
>
> on client A:
> rbd map rbd/test
> mkfs -t xfs /dev/rbd0
> mount /dev/rbd/rbd/test /mnt/
> echo 124 > /mnt/host124
> cat host124
> 124
>
> on client B:
> rbd map rbd/test
> mount /dev/rbd/rbd/test /mnt/
> cat  host124  --> show nothing!
>
> echo 125 > /mnt/host125
> cat /mnt/host125
> 125
>
> then on client C:
> rbd map rbd/test
> mount /dev/rbd/rbd/test /mnt/
> cd /mnt
> cat host124 --> show nothing!
> cat host125 --> show nothing!
>
> (2) does rbd image supports stripping? if does, howto?

Not yet, but it's a work-in-progress for krbd to support "fancy"
striping (librbd would support it via rbd-nbd).

> on monitor, i create an image as following:
> rbd create rbd/test --image-feature layering,striping,exclusive-lock --size
> 1024 --object-size 4096 --stripe-unit 4096  --stripe-count 2
> stripe unit is not a factor of the object size
> rbd create rbd/test --image-feature layering,striping,exclusive-lock --size
> 1024 --object-size 8M --stripe-unit 4M --stripe-count 2
> rbd: the argument ('4M') for option '--unit' is invalid
> i don't know why those cmd fails?

Only Luminous and later releases support specifying the stripe unit
with B/K/M suffixes.

> finally, i successed with the following cmd:
> rbd create rbd/test --image-feature layering,striping,exclusive-lock --size
> 1024 --object-size 8388608 --stripe-unit 4194304  --stripe-count 2
>
> but whe i map it on client, it fails.
> the error msg:
> rbd: image test: unsupported stripe unit (got 4194304 want 8388608)
>
> best wishes
> thanks
>
> 
> 13605702...@163.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] questions about rbd image

2017-12-11 Thread 13605702...@163.com
hi

i'm testing on rbd image. the are TWO questions that confused me. 
ceph -v
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
uname -r
3.10.0-514.el7.x86_64

(1)  does rbd image supports multiple clients to write data simultaneously?  if 
it supports, how can share data between several clients using rbd image?
client A: write data to rbd/test
client B: rbd map, and mount it to /mnt, file can be found in /mnt dir, but the 
content is miss.

on monitor:
rbd create rbd/test -s 1024
rbd info rbd/test
rbd image 'test':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.121d238e1f29
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags: 
then i disble the feature: object-map, fast-diff, deep-flatten

on client A:
rbd map rbd/test
mkfs -t xfs /dev/rbd0
mount /dev/rbd/rbd/test /mnt/
echo 124 > /mnt/host124
cat host124 
124

on client B:
rbd map rbd/test
mount /dev/rbd/rbd/test /mnt/
cat  host124  --> show nothing!

echo 125 > /mnt/host125
cat /mnt/host125 
125

then on client C: 
rbd map rbd/test
mount /dev/rbd/rbd/test /mnt/
cd /mnt
cat host124 --> show nothing!
cat host125 --> show nothing!

(2) does rbd image supports stripping? if does, howto?

on monitor, i create an image as following: 
rbd create rbd/test --image-feature layering,striping,exclusive-lock --size 
1024 --object-size 4096 --stripe-unit 4096  --stripe-count 2
stripe unit is not a factor of the object size
rbd create rbd/test --image-feature layering,striping,exclusive-lock --size 
1024 --object-size 8M --stripe-unit 4M --stripe-count 2
rbd: the argument ('4M') for option '--unit' is invalid
i don't know why those cmd fails?

finally, i successed with the following cmd:
rbd create rbd/test --image-feature layering,striping,exclusive-lock --size 
1024 --object-size 8388608 --stripe-unit 4194304  --stripe-count 2

but whe i map it on client, it fails.
the error msg: 
rbd: image test: unsupported stripe unit (got 4194304 want 8388608)

best wishes
thanks



13605702...@163.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] The way to minimize osd memory usage?

2017-12-11 Thread Hans van den Bogert
There’s probably multiple reasons. However I just wanted to chime in that I set 
my cache size to 1G and I constantly see OSD memory converge to ~2.5GB. 

In [1] you can see the difference between a node with 4 OSDs, v12.2.2, on the 
left; and a node with 4 OSDs v12.2.1 on the right. I really hoped that v12.2.2 
would make the memory usage a bit closer to the cache parameter. almost 2.5x, 
in contrast to 3x of 12.2.1, is still quite far off IMO.

Practically, I think it’s not quite possible to have 2 OSDs on your 2GB server, 
let alone have some leeway memory.


[1] https://pasteboard.co/GXHO5eF.png 

> On Dec 11, 2017, at 3:44 AM, shadow_lin  wrote:
> 
> My workload is mainly seq write(for surveillance usage).I am not sure how 
> cache would effect the write performance and why the memory usage keeps 
> increasing as more data is wrote into ceph storage.
>  
> 2017-12-11 
> lin.yunfan
> 发件人:Peter Woodman 
> 发送时间:2017-12-11 05:04
> 主题:Re: [ceph-users] The way to minimize osd memory usage?
> 收件人:"David Turner"
> 抄送:"shadow_lin","ceph-users","Konstantin
>  Shalygin"
>  
> I've had some success in this configuration by cutting the bluestore 
> cache size down to 512mb and only one OSD on an 8tb drive. Still get 
> occasional OOMs, but not terrible. Don't expect wonderful performance, 
> though. 
>  
> Two OSDs would really be pushing it. 
>  
> On Sun, Dec 10, 2017 at 10:05 AM, David Turner  wrote: 
> > The docs recommend 1GB/TB of OSDs. I saw people asking if this was still 
> > accurate for bluestore and the answer was that it is more true for 
> > bluestore 
> > than filestore. There might be a way to get this working at the cost of 
> > performance. I would look at Linux kernel memory settings as much as ceph 
> > and bluestore settings. Cache pressure is one that comes to mind that an 
> > aggressive setting might help. 
> > 
> > 
> > On Sat, Dec 9, 2017, 11:33 PM shadow_lin  wrote: 
> >> 
> >> The 12.2.1(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) 
> >> we are running is with the memory issues fix.And we are working on to 
> >> upgrade to 12.2.2 release to see if there is any furthermore improvement. 
> >> 
> >> 2017-12-10 
> >>  
> >> lin.yunfan 
> >>  
> >> 
> >> 发件人:Konstantin Shalygin  
> >> 发送时间:2017-12-10 12:29 
> >> 主题:Re: [ceph-users] The way to minimize osd memory usage? 
> >> 收件人:"ceph-users" 
> >> 抄送:"shadow_lin" 
> >> 
> >> 
> >> > I am testing running ceph luminous(12.2.1-249-g42172a4 
> >> > (42172a443183ffe6b36e85770e53fe678db293bf) on ARM server. 
> >> Try new 12.2.2 - this release should fix memory issues with Bluestore. 
> >> 
> >> ___ 
> >> ceph-users mailing list 
> >> ceph-users@lists.ceph.com 
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> > 
> > ___ 
> > ceph-users mailing list 
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mgr dashboard and cull Removing data for x

2017-12-11 Thread Dan Van Der Ster
Hi all,

I'm playing with the dashboard module in 12.2.2 (and it's very cool!) but I 
noticed that some OSDs do not have metadata, e.g. this page:

http://xxx:7000/osd/perf/74

Has empty metadata. I *am* able to see all the info with `ceph osd metadata 74`.

I noticed in the mgr log we have:

2017-12-11 12:22:19.072613 7fb12df95700  4 mgr cull Removing data for 74
2017-12-11 12:22:19.072629 7fb12df95700  4 mgr cull Removing data for 75
2017-12-11 12:22:19.072640 7fb12df95700  4 mgr cull Removing data for 77
2017-12-11 12:22:19.072653 7fb12df95700  4 mgr cull Removing data for 78
...

This seems like a random set of OSDs from across the cluster. I restarted 
osd.74 and it doesn't change.

Does anyone have a clue what's happening here?

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to remove a faulty bucket?

2017-12-11 Thread Martin Emrich
Hi!

Am 09.12.17, 00:19 schrieb "Robin H. Johnson" :

If you use 'radosgw-admin bi list', you can get a listing of the raw bucket
index. I'll bet that the objects aren't being shown at the S3 layer
because something is wrong with them. But since they are in the bi-list,
you'll get 409 BucketNotEmpty.

Yes indeed. Running "radosgw-admin bi list" results in an incomplete 300MB JSON 
file, before it freezes.
FYI: "radosgw-admin bi get" returns "no such file or directory" immediately.


At this point, I've found two different approaches, depending how much
you want to do in rgw vs the S3 APIs.
A) S3 APIs: upload new zero-byte files that match all names from the
   bucket index. Then delete them.
B) 'radosgw-admin object unlink'. This got messy with big multipart
   items.

I'll try the first variant, maybe it at least removes some of the cruft.


Other things that can stop deletion of buckets that look empty:
- open/incomplete multipart uploads: run Abort Multipart Upload
  on each upload.
- bucket subresources (cors, website) [iirc this was a bug that got
  fixed].

I have not personally played with editing the bi entries in cases like
this.

There are more drastic ways to delete the entry points into a bucket as
well (but it would otherwise leave the mess around).

I am all ears, as long as it won't touch the other healty buckets __

Thanks,

Martin
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] public/cluster network

2017-12-11 Thread Roman

Hi all,

We would like to implement the following setup.
Our cloud nodes (CNs) for virtual machines  have two 10 Gbps NICs: 
10.x.y.z/22 (routed through the backbone) and 172.x.y.z/24 (available 
only on servers within single rack). CNs and ceph nodes are in the same 
rack. Ceph nodes have two 10 Gpbs NICs in the same networks. We are 
going to use 172.x.y.z/24 as ceph cluster network for all ceph 
components' traffic (i.e. OSD/MGR/MON/MDS). But apart from that we are 
thinking about to use the same cluster network for CNs interactions with 
ceph nodes (since it is expected the network within single switch in 
rack to be much faster then the routed via backbone one).
So 172.x.y.z/24 is for the following: pure ceph traffic, CNs <=> ceph 
nodes; 10.x.y.z/22 is for the rest type of ceph clients like VMs with 
mounted cephfs shares (since VMs  doesn't have access to 172.x.y.z/24 net).
So I wonder if it's possible to implement something like the following: 
always use 172.x.y.z/24 if it is availabe on both source and destination 
otherwise use 10.x.y.z/22.

We have just tried to specify the following in ceph.conf:
cluster network = 172.x.y.z/24
public network = 172.x.y.z/24, 10.x.y.z/22

But it doesn't seem to work.
There is an entry in Redhat Knowledgebase portal [1] called "Ceph 
Multiple public networks" but there is not solution provided yet.


[1] https://access.redhat.com/solutions/1463363
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous rgw hangs after sighup

2017-12-11 Thread Martin Emrich
Hi!

This sounds like http://tracker.ceph.com/issues/20763 (or indeed 
http://tracker.ceph.com/issues/20866).

It is still present in 12.2.2 (just tried it). My workaround is to exclude 
radosgw from logrotate (remove "radosgw" from /etc/logrotate.d/ceph) from being 
SIGHUPed, and to rotate the logs manually from time to time and completely 
restarting the radosgw processes one after the other on my radosgw cluster.

Regards,

Martin

Am 08.12.17, 18:58 schrieb "ceph-users im Auftrag von Graham Allan" 
:

I noticed this morning that all four of our rados gateways (luminous 
12.2.2) hung at logrotate time overnight. The last message logged was:

> 2017-12-08 03:21:01.897363 7fac46176700  0 ERROR: failed to clone shard, 
completion_mgr.get_next() returned ret=-125

one of the 3 nodes recorded more detail:
> 2017-12-08 06:51:04.452108 7f80fbfdf700  1 rgw realm reloader: Pausing 
frontends for realm update...
> 2017-12-08 06:51:04.452126 7f80fbfdf700  1 rgw realm reloader: Frontends 
paused
> 2017-12-08 06:51:04.452891 7f8202436700  0 ERROR: failed to clone shard, 
completion_mgr.get_next() returned ret=-125
I remember seeing this happen on our test cluster a while back with 
Kraken. I can't find the tracker issue I originally found related to 
this, but it also sounds like it could be a reversion of bug #20339 or 
#20686?

I recorded some strace output from one of the radosgw instances before 
restarting, if it's useful to open an issue.

-- 
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com