Re: [ceph-users] [question] one-way RBD mirroring doesn't work

2019-09-23 Thread V A Prabha
Dear Jason
   A small update in the setup is that now the image syncing shows 8% and
remains still in the same status...after 1 day I can see the image got
replicated the other side
Answer few of my queries:
   1.Does the image sync will work one by one after 1 image or all images get
sync at the same time
   2.If the first image is syncing 8 % and 2nd image is 0% ..do you think that
the OSD cannot reach the rbd-mirror process?
   3.If the rsync process can transfer the images very easy from one end to
another what is the issue with ceph?
   4.If the benchmarking tool shows that the max network bandwidth is 1Gbps..how
can we identify if there is any bandwidth shaper?

Please do help me out in tracing our issues as we are stuck with our cloud
project because of the issue with ceph

Regards
V.A.Prabha


On August 26, 2019 at 5:23 PM V A Prabha  wrote:
> Dear Jason
> I shall explain my setup first
> The DR centre is 300 Kms apart from the site
> Site-A - OSD 0 - 1 TB Mon - 10.236.248.XX /24
> Site-B - OSD 0 - 1 TB Mon - 10.236.228.XX/27 - RBD-Mirror deamon running
> All ports are open and no firewall..Connectivity is there between
>
> My initial setup I used a common L2 connectivity between both the sites..The
> same error as now
> I have changed the configuration to L3 still I get the same
>
> root@meghdootctr:~# rbd mirror image status volumes/meghdoot
> meghdoot:
> global_id: 52d9e812-75fe-4a54-8e19-0897d9204af9
> state: up+syncing
> description: bootstrapping, IMAGE_COPY/COPY_OBJECT 0%
> last_update: 2019-08-26 17:00:21
> Please do specify where I do the mistake or whats wrong with my configuration
>
> Site-A Site-B
> [global]
> fsid = 494971c1-75e7-4866-b9fb-e98cb8171473
> mon_initial_members = clouddr
> mon_host = 10.236.247.XX
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> public network = 10.236.247.0/24
> osd pool default size = 1
> mon_allow_pool_delete= true
> rbd default features = 125 [global]
> fsid = 494971c1-75e7-4866-b9fb-e98cb8171473
> mon_initial_members = meghdootctr
> mon_host = 10.236.228.XX
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> public network = 10.236.228.64/27
> osd pool default size = 1
> mon_allow_pool_delete= true
> rbd default features = 125
>
> Regards
> V.A.Prabha
>
> On August 20, 2019 at 7:00 PM Jason Dillaman  wrote:
>
> > On Tue, Aug 20, 2019 at 9:23 AM V A Prabha < prab...@cdac.in
> >  > wrote:
> > > > I too face the same problem as mentioned by Sat
> > > All the images created at the primary site are in the state : down+
> > > unknown
> > > Hence in the secondary site the images is 0 % up + syncing all time
> > > No progress
> > > The only error log that is continuously hitting is
> > > 2019-08-20 18:04:38.556908 7f7d4cba3700 -1
> > > rbd::mirror::InstanceWatcher: C_NotifyInstanceRequest: 0x7f7d4000f650
> > > finish: resending after timeout
> > > >
> > This sounds like your rbd-mirror daemon cannot contact all OSDs. Double
> > check
> > your network connectivity and firewall to ensure that rbd-mirror daemon can
> > connect to *both* Ceph clusters (local and remote).
> >
> > > >
> > >
> > > The setup is as follows
> > > One OSD created in the primary site with cluster name [site-a] and one
> > > OSD created in the secondary site with cluster name [site-b] both have the
> > > same ceph.conf file
> > > RBD mirror is installed at the secondary site [ which is 300kms away
> > > from the primary site]
> > > We are trying to merge this with our Cloud but the cinder volume fails
> > > syncing everytime
> > > Primary Site Output
> > > root@clouddr:/etc/ceph# rbd mirror pool status volumesnew --verbose
> > > health: WARNING
> > > images: 4 total
> > > 4 unknown
> > > boss123:
> > > global_id: 7285ed6d-46f4-4345-b597-d24911a110f8
> > > state: down+unknown
> > > description: status not found
> > > last_update:
> > > new123:
> > > global_id: e9f2dd7e-b0ac-4138-bce5-318b40e9119e
> > > state: down+unknown
> > > description: status not found
> > > last_update:
> > >
> > > root@clouddr:/etc/ceph# rbd mirror pool info volumesnew
> > > Mode: pool
> > > Peers: none
> > > root@clouddr:/etc/ceph# rbd mirror pool status volumesnew
> > > health: WARNING
> > > images: 4 total
> > > 4 unknown
> > >
> > > Secondary Site
> > > root@meghdootctr:~# rbd mirror image status volumesnew/boss123
> > > boss123:
> > > global_id: 7285ed6d-46f4-4345-b597-d24911a110f8
> > > state: up+syncing
> > > description: bootstrapping, IMAGE_COPY/COPY_OBJECT 0%
> > > last_update: 2019-08-20 17:24:18
> > > Please help me to identify where do I miss something
> > >
> > > Regards
> > > V.A.Prabha
> > > [150th Anniversary Mahatma Gandhi]
> > >
> > > 
> > > [ C-DAC is on Social-Media too. Kindly follow us at:
> > > Facebook: https://www.facebook.com/CDACINDIA
> > > 

Re: [ceph-users] [question] one-way RBD mirroring doesn't work

2019-08-26 Thread Jason Dillaman
On Mon, Aug 26, 2019 at 7:54 AM V A Prabha  wrote:
>
> Dear Jason
>   I shall explain my setup first
>   The DR centre is 300 Kms apart from the site
>   Site-A   - OSD 0 - 1 TB  Mon - 10.236.248.XX /24
>   Site-B   - OSD 0  - 1 TB  Mon - 10.236.228.XX/27  - RBD-Mirror deamon 
> running
>   All ports are open and no firewall..Connectivity is there between
>
>   My initial setup I used a common L2 connectivity between both the 
> sites..The same error as now
>   I have changed the configuration to L3 still I get the same
>
> root@meghdootctr:~# rbd mirror image status volumes/meghdoot
> meghdoot:
>   global_id:   52d9e812-75fe-4a54-8e19-0897d9204af9
>   state:   up+syncing
>   description: bootstrapping, IMAGE_COPY/COPY_OBJECT 0%
>   last_update: 2019-08-26 17:00:21
> Please do specify where I do the mistake or whats wrong with my configuration

No clue what's wrong w/ your site. Best suggestion that I could offer
would be to enable "debug rbd_mirror=20" / "debug rbd=20" logging for
rbd-mirror and see where it's hanging.

> Site-A Site-B
>  [global]
> fsid = 494971c1-75e7-4866-b9fb-e98cb8171473
> mon_initial_members = clouddr
> mon_host = 10.236.247.XX
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> public network = 10.236.247.0/24
> osd pool default size = 1
> mon_allow_pool_delete= true
> rbd default features = 125[global]
> fsid = 494971c1-75e7-4866-b9fb-e98cb8171473
> mon_initial_members = meghdootctr
> mon_host = 10.236.228.XX
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> public network = 10.236.228.64/27
> osd pool default size = 1
> mon_allow_pool_delete= true
> rbd default features = 125
>
> Regards
> V.A.Prabha
>
> On August 20, 2019 at 7:00 PM Jason Dillaman  wrote:
>
> On Tue, Aug 20, 2019 at 9:23 AM V A Prabha < prab...@cdac.in> wrote:
>
> I too face the same problem as mentioned by Sat
>   All the images created at the primary site are in the state : down+ unknown
>   Hence in the secondary site the images is 0 % up + syncing all time No 
> progress
>   The only error log that is continuously hitting is
>   2019-08-20 18:04:38.556908 7f7d4cba3700 -1 rbd::mirror::InstanceWatcher: 
> C_NotifyInstanceRequest: 0x7f7d4000f650 finish: resending after timeout
>
>
> This sounds like your rbd-mirror daemon cannot contact all OSDs. Double check 
> your network connectivity and firewall to ensure that rbd-mirror daemon can 
> connect to *both* Ceph clusters (local and remote).
>
>
>
>
>   The setup is as follows
>One OSD created in the primary site with cluster name [site-a] and one OSD 
> created in the secondary site with cluster name [site-b] both have the same 
> ceph.conf file
>RBD mirror is installed at the secondary site [ which is 300kms away from 
> the primary site]
>We are trying to merge this with our Cloud but the cinder volume fails 
> syncing everytime
>   Primary Site Output
> root@clouddr:/etc/ceph# rbd mirror pool status volumesnew --verbose
> health: WARNING
> images: 4 total
> 4 unknown
> boss123:
>  global_id:   7285ed6d-46f4-4345-b597-d24911a110f8
>  state:   down+unknown
>  description: status not found
>  last_update:
>  new123:
>  global_id:   e9f2dd7e-b0ac-4138-bce5-318b40e9119e
>  state:   down+unknown
>  description: status not found
>  last_update:
>
> root@clouddr:/etc/ceph# rbd mirror pool info volumesnew
> Mode: pool
> Peers: none
> root@clouddr:/etc/ceph# rbd mirror pool status volumesnew
> health: WARNING
> images: 4 total
> 4 unknown
>
> Secondary Site
> root@meghdootctr:~# rbd mirror image status volumesnew/boss123
> boss123:
>   global_id:   7285ed6d-46f4-4345-b597-d24911a110f8
>   state:   up+syncing
>   description: bootstrapping, IMAGE_COPY/COPY_OBJECT 0%
>   last_update: 2019-08-20 17:24:18
> Please help me to identify where do I miss something
>
> Regards
> V.A.Prabha
>
> 
> [ C-DAC is on Social-Media too. Kindly follow us at:
> Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
>
> This e-mail is for the sole use of the intended recipient(s) and may
> contain confidential and privileged information. If you are not the
> intended recipient, please contact the sender by reply e-mail and destroy
> all copies and the original message. Any unauthorized review, use,
> disclosure, dissemination, forwarding, printing or copying of this email
> is strictly prohibited and appropriate legal action will be taken.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
>
>
>
>
> 

Re: [ceph-users] [question] one-way RBD mirroring doesn't work

2019-08-26 Thread V A Prabha
Dear Jason
  I shall explain my setup first
  The DR centre is 300 Kms apart from the site
  Site-A   - OSD 0 - 1 TB  Mon - 10.236.248.XX /24
  Site-B   - OSD 0  - 1 TB  Mon - 10.236.228.XX/27  - RBD-Mirror deamon running
  All ports are open and no firewall..Connectivity is there between

  My initial setup I used a common L2 connectivity between both the sites..The
same error as now
  I have changed the configuration to L3 still I get the same

root@meghdootctr:~# rbd mirror image status volumes/meghdoot
meghdoot:
  global_id:   52d9e812-75fe-4a54-8e19-0897d9204af9
  state:   up+syncing
  description: bootstrapping, IMAGE_COPY/COPY_OBJECT 0%
  last_update: 2019-08-26 17:00:21
Please do specify where I do the mistake or whats wrong with my configuration

Site-A   Site-B
 [global]
fsid = 494971c1-75e7-4866-b9fb-e98cb8171473
mon_initial_members = clouddr
mon_host = 10.236.247.XX
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public network = 10.236.247.0/24
osd pool default size = 1
mon_allow_pool_delete= true
rbd default features = 125  [global]
fsid = 494971c1-75e7-4866-b9fb-e98cb8171473
mon_initial_members = meghdootctr
mon_host = 10.236.228.XX
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public network = 10.236.228.64/27
osd pool default size = 1
mon_allow_pool_delete= true
rbd default features = 125

Regards
V.A.Prabha

On August 20, 2019 at 7:00 PM Jason Dillaman  wrote:

>  On Tue, Aug 20, 2019 at 9:23 AM V A Prabha < prab...@cdac.in
>  > wrote:
>> >I too face the same problem as mentioned by Sat
> >  All the images created at the primary site are in the state : down+
> > unknown
> >  Hence in the secondary site the images is 0 % up + syncing all time
> > No progress
> >  The only error log that is continuously hitting is
> >  2019-08-20 18:04:38.556908 7f7d4cba3700 -1
> > rbd::mirror::InstanceWatcher: C_NotifyInstanceRequest: 0x7f7d4000f650
> > finish: resending after timeout
> >  > 
>  This sounds like your rbd-mirror daemon cannot contact all OSDs. Double check
> your network connectivity and firewall to ensure that rbd-mirror daemon can
> connect to *both* Ceph clusters (local and remote).
> 
>> > 
> > 
> >  The setup is as follows
> >   One OSD created in the primary site with cluster name [site-a] and one
> > OSD created in the secondary site with cluster name [site-b] both have the
> > same ceph.conf file
> >   RBD mirror is installed at the secondary site [ which is 300kms away
> > from the primary site]
> >   We are trying to merge this with our Cloud but the cinder volume fails
> > syncing everytime
> >  Primary Site Output
> >root@clouddr:/etc/ceph# rbd mirror pool status volumesnew --verbose
> >health: WARNING
> >images: 4 total
> >4 unknown
> >boss123:
> > global_id:   7285ed6d-46f4-4345-b597-d24911a110f8
> > state:   down+unknown
> > description: status not found
> > last_update:
> > new123:
> > global_id:   e9f2dd7e-b0ac-4138-bce5-318b40e9119e
> > state:   down+unknown
> > description: status not found
> > last_update:
> > 
> >root@clouddr:/etc/ceph# rbd mirror pool info volumesnew
> >Mode: pool
> >Peers: none
> >root@clouddr:/etc/ceph# rbd mirror pool status volumesnew
> >health: WARNING
> >images: 4 total
> >4 unknown
> > 
> >Secondary Site
> >root@meghdootctr:~# rbd mirror image status volumesnew/boss123
> >boss123:
> >  global_id:   7285ed6d-46f4-4345-b597-d24911a110f8
> >  state:   up+syncing
> >  description: bootstrapping, IMAGE_COPY/COPY_OBJECT 0%
> >  last_update: 2019-08-20 17:24:18
> >Please help me to identify where do I miss something
> > 
> >Regards
> >V.A.Prabha
> >[150th Anniversary Mahatma Gandhi]
> > 
> >   
> > 
> >[ C-DAC is on Social-Media too. Kindly follow us at:
> >Facebook: https://www.facebook.com/CDACINDIA
> >  & Twitter: @cdacindia ]
> > 
> >This e-mail is for the sole use of the intended recipient(s) and may
> >contain confidential and privileged information. If you are not the
> >intended recipient, please contact the sender by reply e-mail and destroy
> >all copies and the original message. Any unauthorized review, use,
> >disclosure, dissemination, forwarding, printing or copying of this email
> >is strictly prohibited and appropriate legal action will be taken.
> > 
> >   
> > 
> >___
> >ceph-users mailing list
> >ceph-users@lists.ceph.com 

Re: [ceph-users] [question] one-way RBD mirroring doesn't work

2019-08-20 Thread Jason Dillaman
On Tue, Aug 20, 2019 at 9:23 AM V A Prabha  wrote:

> I too face the same problem as mentioned by Sat
>   All the images created at the primary site are in the state : down+
> unknown
>   Hence in the secondary site the images is 0 % up + syncing all time
> No progress
>   The only error log that is continuously hitting is
> *  2019-08-20 18:04:38.556908 7f7d4cba3700 -1
> rbd::mirror::InstanceWatcher: C_NotifyInstanceRequest: 0x7f7d4000f650
> finish: resending after timeout*
>

This sounds like your rbd-mirror daemon cannot contact all OSDs. Double
check your network connectivity and firewall to ensure that rbd-mirror
daemon can connect to *both* Ceph clusters (local and remote).


>
>
>   The setup is as follows
>One OSD created in the primary site with cluster name [site-a] and one
> OSD created in the secondary site with cluster name [site-b] both have the
> same ceph.conf file
>RBD mirror is installed at the secondary site [ which is 300kms away
> from the primary site]
>We are trying to merge this with our Cloud but the cinder volume fails
> syncing everytime
> *   Primary Site Output*
> root@clouddr:/etc/ceph# rbd mirror pool status volumesnew --verbose
> health: WARNING
> images: 4 total
> 4 unknown
> boss123:
>  global_id:   7285ed6d-46f4-4345-b597-d24911a110f8
>  state:   down+unknown
>  description: status not found
>  last_update:
>  new123:
>  global_id:   e9f2dd7e-b0ac-4138-bce5-318b40e9119e
>  state:   down+unknown
>  description: status not found
>  last_update:
>
> root@clouddr:/etc/ceph# rbd mirror pool info volumesnew
> Mode: pool
> Peers: none
> root@clouddr:/etc/ceph# rbd mirror pool status volumesnew
> health: WARNING
> images: 4 total
> 4 unknown
>
> *Secondary Site*
>
> root@meghdootctr:~# rbd mirror image status volumesnew/boss123
> boss123:
>   global_id:   7285ed6d-46f4-4345-b597-d24911a110f8
>   state:   up+syncing
>   description: bootstrapping, IMAGE_COPY/COPY_OBJECT 0%
>   last_update: 2019-08-20 17:24:18
>
> Please help me to identify where do I miss something
>
> Regards
> V.A.Prabha
> [image: 150th Anniversary Mahatma Gandhi]
> 
>
> [ C-DAC is on Social-Media too. Kindly follow us at:
> Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
>
> This e-mail is for the sole use of the intended recipient(s) and may
> contain confidential and privileged information. If you are not the
> intended recipient, please contact the sender by reply e-mail and destroy
> all copies and the original message. Any unauthorized review, use,
> disclosure, dissemination, forwarding, printing or copying of this email
> is strictly prohibited and appropriate legal action will be taken.
> 
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [question] one-way RBD mirroring doesn't work

2019-08-20 Thread V A Prabha
I too face the same problem as mentioned by Sat
  All the images created at the primary site are in the state : down+ unknown
  Hence in the secondary site the images is 0 % up + syncing all time No
progress
  The only error log that is continuously hitting is
  2019-08-20 18:04:38.556908 7f7d4cba3700 -1 rbd::mirror::InstanceWatcher:
C_NotifyInstanceRequest: 0x7f7d4000f650 finish: resending after timeout

  The setup is as follows
   One OSD created in the primary site with cluster name [site-a] and one OSD
created in the secondary site with cluster name [site-b] both have the same
ceph.conf file
   RBD mirror is installed at the secondary site [ which is 300kms away from the
primary site]
   We are trying to merge this with our Cloud but the cinder volume fails
syncing everytime
  Primary Site Output
root@clouddr:/etc/ceph# rbd mirror pool status volumesnew --verbose
health: WARNING
images: 4 total
4 unknown
boss123:
 global_id:   7285ed6d-46f4-4345-b597-d24911a110f8
 state:   down+unknown
 description: status not found
 last_update:
 new123:
 global_id:   e9f2dd7e-b0ac-4138-bce5-318b40e9119e
 state:   down+unknown
 description: status not found
 last_update:

root@clouddr:/etc/ceph# rbd mirror pool info volumesnew
Mode: pool
Peers: none
root@clouddr:/etc/ceph# rbd mirror pool status volumesnew
health: WARNING
images: 4 total
4 unknown

Secondary Site

root@meghdootctr:~# rbd mirror image status volumesnew/boss123
boss123:
  global_id:   7285ed6d-46f4-4345-b597-d24911a110f8
  state:   up+syncing
  description: bootstrapping, IMAGE_COPY/COPY_OBJECT 0%
  last_update: 2019-08-20 17:24:18

Please help me to identify where do I miss something

Regards
V.A.Prabha

[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]

This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question to developers about iscsi

2019-08-14 Thread Fyodor Ustinov
Hi!

As I understand - iscsi gate is part of ceph.

Documentation says:
Note
The iSCSI management functionality of Ceph Dashboard depends on the latest 
version 3 of the ceph-iscsi project. Make sure that your operating system 
provides the correct version, otherwise the dashboard won’t enable the 
management features. 

Questions:
Where I can download ready for install deb package of version 3 ceph-iscsi 
project? rpm package?
Why "version 3"? Why not "14.2.2"?

Or ceph-iscsi is not part of ceph?

WBR,
Fyodor.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question regarding client-network

2019-01-31 Thread Buchberger, Carsten
Thank you - we were expecting that, but wanted to be sure.
By the way - we are running our clusters on IPv6-BGP, to achieve massive 
scalability and load-balancing ;-)

Mit freundlichen Grüßen
Carsten Buchberger


[WITCOM_LOGO_CS4_CMYK_1.png]

WiTCOM Wiesbadener Informations-
und Telekommunikations GmbH

Carsten Buchberger
Technik
Netze & Systeme
___

fon +49 611-26244-211
fax +49 611-26244-262
c.buchber...@witcom.de
www.witcom.de

Konradinerallee 25
65189 Wiesbaden

Technische-Hotline:
08000-948266 (08000-WiTCOM)

HRB 10344 Wiesbaden
Steuernummer: 43 248 1943 6
Geschäftsführer: Dipl-Ing. Ralf Jung
Vorsitzender des Aufsichtsrates: Bürgermeister Dr. Oliver Franz

Ein Unternehmen der ESWE Versorgungs AG

[20y_witcom]


WiTCOM bringt alle Wiesbadener Gewerbegebiete ans Glasfasernetz! Haben Sie 
Fragen zum Ausbau? Dann rufen Sie uns an: 0611-26244-135
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question regarding client-network

2019-01-30 Thread Robert Sander
On 30.01.19 08:55, Buchberger, Carsten wrote:

> So as long as there is ip-connectivity between the client, and the
> client-network ip –adressses of our ceph-cluster everything is fine ?

Yes, client traffic is routable.

Even inter-OSD traffic is routable, there are reports from people
running routing protocols inside their Ceph clusters.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 93818 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question regarding client-network

2019-01-29 Thread Buchberger, Carsten
Hi,

it might be dumb question - our ceph-cluster runs with dedicated client and 
cluster network.

I understand it like this : client -network is the network interface from where 
the client connections come from (from the mon & osd perspective), regardless 
of the source-ip-address.
So as long as there is ip-connectivity between the client, and the 
client-network ip -adressses of our ceph-cluster everything is fine ?
Or is the client-network on ceph-side some kind of acl, that denies access if 
the client does not originate from the defined network ? The latter one would 
be bad ;-)

Best regards
Carsten Buchberger


[20y_witcom]


WiTCOM bringt alle Wiesbadener Gewerbegebiete ans Glasfasernetz! Haben Sie 
Fragen zum Ausbau? Dann rufen Sie uns an: 0611-26244-135
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about 'firstn|indep'

2018-08-23 Thread Gregory Farnum
On Thu, Aug 23, 2018 at 10:21 AM Cody  wrote:

> So, is it okay to say that compared to the 'firstn' mode, the 'indep'
> mode may have the least impact on a cluster in an event of OSD
> failure? Could I use 'indep' for replica pool as well?
>

You could, but shouldn't. Imagine if the primary OSD fails and you're using
indep: then the new primary won't know anything at all about the PG, so
it's just going to have to set a pgtemp mapping that gives it back to one
of the old nodes anyway!

In the EC case that happens too, but it's unavoidable: all nodes have
individual data stored, so on the loss of a primary you're going to need a
few more round-trips anyway (and in fact EC pools regularly have a primary
which isn't the first in the list, unlike replicated ones).
-Greg


>
> Thank you!
>
> Regards,
> Cody
> On Wed, Aug 22, 2018 at 7:12 PM Gregory Farnum  wrote:
> >
> > On Wed, Aug 22, 2018 at 12:56 AM Konstantin Shalygin 
> wrote:
> >>
> >> > Hi everyone,
> >> >
> >> > I read an earlier thread [1] that made a good explanation on the 'step
> >> > choose|chooseleaf' option. Could someone further help me to understand
> >> > the 'firstn|indep' part? Also, what is the relationship between 'step
> >> > take' and 'step choose|chooseleaf' when it comes to define a failure
> >> > domain?
> >> >
> >> > Thank you very much.
> >>
> >>
> >> This documented on CRUSH Map Rules [1]
> >>
> >>
> >> [1]
> >>
> http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/#crush-map-rules
> >>
> >
> > But that doesn't seem to really discuss it, and I don't see it elsewhere
> in our docs either. So:
> >
> > "indep" and "firstn" are two different strategies for selecting items
> (mostly, OSDs) in a CRUSH hierarchy. If you're storing EC data you want to
> use indep; if you're storing replicated data you want to use firstn.
> >
> > The reason has to do with how they behave when a previously-selected
> devices fails. Let's say you have a PG stored on OSDs 1, 2, 3, 4, 5. Then 3
> goes down.
> > With the "firstn" mode, CRUSH simply adjusts its calculation in a way
> that it selects 1 and 2, then selects 3 but discovers it's down, so it
> retries and selects 4 and 5, and then goes on to select a new OSD 6. So the
> final CRUSH mapping change is
> > 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6.
> >
> > But if you're storing an EC pool, that means you just changed the data
> mapped to OSDs 4, 5, and 6! That's terrible! So the "indep" mode attempts
> to not do that. (It still *might* conflict, but the odds are much lower).
> You can instead expect it, when it selects the failed 3, to try again and
> pick out 6, for a final transformation of:
> > 1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5
> > -Greg
> >
> >>
> >>
> >>
> >> k
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about 'firstn|indep'

2018-08-23 Thread Cody
So, is it okay to say that compared to the 'firstn' mode, the 'indep'
mode may have the least impact on a cluster in an event of OSD
failure? Could I use 'indep' for replica pool as well?

Thank you!

Regards,
Cody
On Wed, Aug 22, 2018 at 7:12 PM Gregory Farnum  wrote:
>
> On Wed, Aug 22, 2018 at 12:56 AM Konstantin Shalygin  wrote:
>>
>> > Hi everyone,
>> >
>> > I read an earlier thread [1] that made a good explanation on the 'step
>> > choose|chooseleaf' option. Could someone further help me to understand
>> > the 'firstn|indep' part? Also, what is the relationship between 'step
>> > take' and 'step choose|chooseleaf' when it comes to define a failure
>> > domain?
>> >
>> > Thank you very much.
>>
>>
>> This documented on CRUSH Map Rules [1]
>>
>>
>> [1]
>> http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/#crush-map-rules
>>
>
> But that doesn't seem to really discuss it, and I don't see it elsewhere in 
> our docs either. So:
>
> "indep" and "firstn" are two different strategies for selecting items 
> (mostly, OSDs) in a CRUSH hierarchy. If you're storing EC data you want to 
> use indep; if you're storing replicated data you want to use firstn.
>
> The reason has to do with how they behave when a previously-selected devices 
> fails. Let's say you have a PG stored on OSDs 1, 2, 3, 4, 5. Then 3 goes down.
> With the "firstn" mode, CRUSH simply adjusts its calculation in a way that it 
> selects 1 and 2, then selects 3 but discovers it's down, so it retries and 
> selects 4 and 5, and then goes on to select a new OSD 6. So the final CRUSH 
> mapping change is
> 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6.
>
> But if you're storing an EC pool, that means you just changed the data mapped 
> to OSDs 4, 5, and 6! That's terrible! So the "indep" mode attempts to not do 
> that. (It still *might* conflict, but the odds are much lower). You can 
> instead expect it, when it selects the failed 3, to try again and pick out 6, 
> for a final transformation of:
> 1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5
> -Greg
>
>>
>>
>>
>> k
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [question] one-way RBD mirroring doesn't work

2018-08-23 Thread Jason Dillaman
On Thu, Aug 23, 2018 at 10:56 AM sat  wrote:
>
> Hi,
>
>
> I'm trying to make a one-way RBD mirroed cluster between two Ceph clusters. 
> But it
> hasn't worked yet. It seems to sucecss, but after making an RBD image from 
> local cluster,
> it's considered as "unknown".
>
> ```
> $ sudo rbd --cluster local create rbd/local.img --size=1G 
> --image-feature=exclusive-lock,journaling
> $ sudo rbd --cluster local ls rbd
> local.img
> $ sudo rbd --cluster remote ls rbd
> local.img
> $ sudo rbd --cluster local mirror pool status rbd
> health: WARNING
> images: 1 total
> 1 unknown
> $ sudo rbd --cluster remote mirror pool status rbd
> health: OK
> images: 1 total
> 1 replaying
> $
> ```
>
> Could you tell me what is wrong?

Nothing -- with one-directional RBD mirroring, on the receive side
would report status. If you started an rbd-mirror daemon against the
"local" cluster, it would report as healthy w/ that particular image
in the "stopped" state since it's primary.

>
> # detail
>
> There are two clusters, named "local" and "remote". "remote" is the mirror of 
> "local".
> Both two clusters has a pool, named "rbd".
>
> ## system environment
>
> - OS: ubuntu 16.04
> - kernel: 4.4.0-112-generic
> - ceph: luminous 12.2.5
>
> ## system configuration diagram
>
> ==
> +- manager(192.168.33.2): manipulate two clusters,
> |
> +- node0(192.168.33.3): "local"'s MON, MGR, and OSD0
> |
> +- node1(192.168.33.4); "local"'s OSD1
> |
> +- node2(192.168.33.5); "local"'s OSD2
> |
> +- remote-node0(192.168.33.7): "remote"'s MON, MGR, OSD0, and ceph-rbd-mirror
> |
> +- remote-node1(192.168.33.8); "remote"'s OSD1
> |
> +- remote-node2(192.168.33.9); "remote"'s OSD2
> 
>
> # Step to reproduce
>
> 1. Prepare two clusters "local" and "remote"
>
> ```
> $ sudo ceph --cluster local -s
>   cluster:
>   id: 9faca802-745d-43d8-b572-16617e553a5f
>   health: HEALTH_WARN
>   application not enabled on 1 pool(s)
>
>   services:
>   mon: 1 daemons, quorum 0
>   mgr: 0(active)
>   osd: 3 osds: 3 up, 3 in
>
>   data:
>   pools:   1 pools, 128 pgs
>   objects: 16 objects, 12395 kB
>   usage:   3111 MB used, 27596 MB / 30708 MB avail
>   pgs: 128 active+clean
>
>   io:
>   client:   852 B/s rd, 0 op/s rd, 0 op/s wr
>
> $ sudo ceph --cluster remote -s
>   cluster:
>   id: 1ecb0aa6-5a00-4946-bdba-bad78bfa4372
>   health: HEALTH_WARN
>   application not enabled on 1 pool(s)
>
>   services:
>   mon:1 daemons, quorum 0
>   mgr:0(active)
>   osd:3 osds: 3 up, 3 in
>   rbd-mirror: 1 daemon active
>
>   data:
>   pools:   1 pools, 128 pgs
>   objects: 18 objects, 7239 kB
>   usage:   3100 MB used, 27607 MB / 30708 MB avail
>   pgs: 128 active+clean
>
>   io:
>   client:   39403 B/s rd, 0 B/s wr, 4 op/s rd, 0 op/s wr
>
> $
> 
>
>
> Two clusters looks fine.
>
> 2. Setup one-way RBD pool mirroring from "local" to "remote"
>
> Setup an RBD pool mirroring between "local" and "remote" with the following 
> steps.
>
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/block_device_guide/block_device_mirroring
>
> Both cluster's status look fine as follows.
>
> ```
> $ sudo rbd --cluster local mirror pool info rbd
> Mode: pool
> Peers: none
> $ sudo rbd --cluster local mirror pool status rbd
> health: OK
> images: 0 total
> $ sudo rbd --cluster remote mirror pool info rbd
> Mode: pool
> Peers:
>   UUID NAME  CLIENT
>   53fb3a9a-c451-4552-b409-c08709ebe1a9 local client.local
> $ sudo rbd --cluster remote mirror pool status rbd
> health: OK
> images: 0 total
> $
> ```
> 3. Create an RBD image
>
> ```
> $ sudo rbd --cluster local create rbd/local.img --size=1G 
> --image-feature=exclusive-lock,journaling
> $ sudo rbd --cluster local ls rbd
> local.img
> $ sudo rbd --cluster remote ls rbd
> local.img
> $
> ```
>
> "rbd/local.img" seemd to be created and be mirrored fine.
>
> 4. Check both cluster's status and info
>
> Execute "rbd mirror pool info/status " and "info/status 
> " for
> both clusters.
>
> ## expected result
>
> Both "local" and "remote" report fine state.
>
> ## actual result
>
> Although "remote" works fine, but "local" reports seems not fine.
>
> ```
> $ sudo rbd --cluster local mirror pool info rbd
> Mode: pool
> Peers: none
> $ sudo rbd --cluster local mirror pool status rbd
> health: WARNING
> images: 1 total
> 1 unknown
> $ sudo rbd --cluster local info rbd/local.img
> rbd image 'local.img':
> size 1024 MB in 256 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.10336b8b4567
> format: 2
> features: exclusive-lock, journaling
> flags:
> 

[ceph-users] [question] one-way RBD mirroring doesn't work

2018-08-23 Thread sat
Hi,


I'm trying to make a one-way RBD mirroed cluster between two Ceph clusters. But 
it
hasn't worked yet. It seems to sucecss, but after making an RBD image from 
local cluster,
it's considered as "unknown".

```
$ sudo rbd --cluster local create rbd/local.img --size=1G 
--image-feature=exclusive-lock,journaling
$ sudo rbd --cluster local ls rbd
local.img
$ sudo rbd --cluster remote ls rbd
local.img
$ sudo rbd --cluster local mirror pool status rbd
health: WARNING
images: 1 total
    1 unknown
$ sudo rbd --cluster remote mirror pool status rbd
health: OK
images: 1 total
    1 replaying
$ 
```

Could you tell me what is wrong?

# detail

There are two clusters, named "local" and "remote". "remote" is the mirror of 
"local".
Both two clusters has a pool, named "rbd".

## system environment

- OS: ubuntu 16.04
- kernel: 4.4.0-112-generic
- ceph: luminous 12.2.5

## system configuration diagram

==
+- manager(192.168.33.2): manipulate two clusters, 
|
+- node0(192.168.33.3): "local"'s MON, MGR, and OSD0
|
+- node1(192.168.33.4); "local"'s OSD1
|
+- node2(192.168.33.5); "local"'s OSD2
|
+- remote-node0(192.168.33.7): "remote"'s MON, MGR, OSD0, and ceph-rbd-mirror
|
+- remote-node1(192.168.33.8); "remote"'s OSD1
|
+- remote-node2(192.168.33.9); "remote"'s OSD2


# Step to reproduce

1. Prepare two clusters "local" and "remote"

```
$ sudo ceph --cluster local -s
  cluster:
      id:     9faca802-745d-43d8-b572-16617e553a5f
          health: HEALTH_WARN
              application not enabled on 1 pool(s)

  services:
      mon: 1 daemons, quorum 0
          mgr: 0(active)
              osd: 3 osds: 3 up, 3 in

  data:
      pools:   1 pools, 128 pgs
          objects: 16 objects, 12395 kB
              usage:   3111 MB used, 27596 MB / 30708 MB avail
                  pgs:     128 active+clean

  io:
      client:   852 B/s rd, 0 op/s rd, 0 op/s wr

$ sudo ceph --cluster remote -s
  cluster:
      id:     1ecb0aa6-5a00-4946-bdba-bad78bfa4372
          health: HEALTH_WARN
                      application not enabled on 1 pool(s)

  services:
      mon:        1 daemons, quorum 0
          mgr:        0(active)
              osd:        3 osds: 3 up, 3 in
                  rbd-mirror: 1 daemon active

  data:
      pools:   1 pools, 128 pgs
          objects: 18 objects, 7239 kB
              usage:   3100 MB used, 27607 MB / 30708 MB avail
                  pgs:     128 active+clean

  io:
      client:   39403 B/s rd, 0 B/s wr, 4 op/s rd, 0 op/s wr
      
$ 



Two clusters looks fine.

2. Setup one-way RBD pool mirroring from "local" to "remote"

Setup an RBD pool mirroring between "local" and "remote" with the following 
steps.

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/block_device_guide/block_device_mirroring

Both cluster's status look fine as follows.

```
$ sudo rbd --cluster local mirror pool info rbd
Mode: pool
Peers: none
$ sudo rbd --cluster local mirror pool status rbd
health: OK
images: 0 total
$ sudo rbd --cluster remote mirror pool info rbd
Mode: pool
Peers:
  UUID                                 NAME  CLIENT
  53fb3a9a-c451-4552-b409-c08709ebe1a9 local client.local
$ sudo rbd --cluster remote mirror pool status rbd
health: OK
images: 0 total
$ 
```
3. Create an RBD image

```
$ sudo rbd --cluster local create rbd/local.img --size=1G 
--image-feature=exclusive-lock,journaling
$ sudo rbd --cluster local ls rbd
local.img
$ sudo rbd --cluster remote ls rbd
local.img
$ 
```

"rbd/local.img" seemd to be created and be mirrored fine.

4. Check both cluster's status and info

Execute "rbd mirror pool info/status " and "info/status " 
for
both clusters.

## expected result

Both "local" and "remote" report fine state.

## actual result

Although "remote" works fine, but "local" reports seems not fine.

```
$ sudo rbd --cluster local mirror pool info rbd
Mode: pool
Peers: none
$ sudo rbd --cluster local mirror pool status rbd
health: WARNING
images: 1 total
    1 unknown
$ sudo rbd --cluster local info rbd/local.img
rbd image 'local.img':
        size 1024 MB in 256 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.10336b8b4567
        format: 2
        features: exclusive-lock, journaling
        flags:
        create_timestamp: Mon Aug 20 06:01:29 2018
        journal: 10336b8b4567
        mirroring state: enabled
        mirroring global id: 447731a7-73ce-448d-90ac-38d05065f603
        mirroring primary: true
$ sudo rbd --cluster local status rbd/local.img
Watchers: none
$ sudo rbd --cluster remote mirror pool info rbd
Mode: pool
Peers:
  UUID                                 NAME  CLIENT
    53fb3a9a-c451-4552-b409-c08709ebe1a9 local client.local
$ sudo rbd --cluster remote mirror pool status rbd
health: OK
images: 1 total
    1 replaying
$ sudo rbd --cluster remote info rbd/local.img
rbd image 'local.img':
        

Re: [ceph-users] Question about 'firstn|indep'

2018-08-22 Thread Gregory Farnum
On Wed, Aug 22, 2018 at 12:56 AM Konstantin Shalygin  wrote:

> > Hi everyone,
> >
> > I read an earlier thread [1] that made a good explanation on the 'step
> > choose|chooseleaf' option. Could someone further help me to understand
> > the 'firstn|indep' part? Also, what is the relationship between 'step
> > take' and 'step choose|chooseleaf' when it comes to define a failure
> > domain?
> >
> > Thank you very much.
>
>
> This documented on CRUSH Map Rules [1]
>
>
> [1]
>
> http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/#crush-map-rules
>
>
But that doesn't seem to really discuss it, and I don't see it elsewhere in
our docs either. So:

"indep" and "firstn" are two different strategies for selecting items
(mostly, OSDs) in a CRUSH hierarchy. If you're storing EC data you want to
use indep; if you're storing replicated data you want to use firstn.

The reason has to do with how they behave when a previously-selected
devices fails. Let's say you have a PG stored on OSDs 1, 2, 3, 4, 5. Then 3
goes down.
With the "firstn" mode, CRUSH simply adjusts its calculation in a way that
it selects 1 and 2, then selects 3 but discovers it's down, so it retries
and selects 4 and 5, and then goes on to select a new OSD 6. So the final
CRUSH mapping change is
1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6.

But if you're storing an EC pool, that means you just changed the data
mapped to OSDs 4, 5, and 6! That's terrible! So the "indep" mode attempts
to not do that. (It still *might* conflict, but the odds are much lower).
You can instead expect it, when it selects the failed 3, to try again and
pick out 6, for a final transformation of:
1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5
-Greg


>
>
> k
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about 'firstn|indep'

2018-08-22 Thread Konstantin Shalygin

Hi everyone,

I read an earlier thread [1] that made a good explanation on the 'step
choose|chooseleaf' option. Could someone further help me to understand
the 'firstn|indep' part? Also, what is the relationship between 'step
take' and 'step choose|chooseleaf' when it comes to define a failure
domain?

Thank you very much.



This documented on CRUSH Map Rules [1]


[1] 
http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/#crush-map-rules




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about 'firstn|indep'

2018-08-21 Thread Cody
Hi everyone,

I read an earlier thread [1] that made a good explanation on the 'step
choose|chooseleaf' option. Could someone further help me to understand
the 'firstn|indep' part? Also, what is the relationship between 'step
take' and 'step choose|chooseleaf' when it comes to define a failure
domain?

Thank you very much.

Regards,
Cody

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-June/010370.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question on cluster balance and data distribution

2018-06-08 Thread Martin Palma
Hi all,

In our current production cluster we have the following CRUSH
hierarchy, see https://pastebin.com/640Q4XSH or the attached image.

This reflects 1:1 real physical deployment. We currently use also a
replica factor of 3 with the following CRUSH rule on our pools:

rule hdd_replicated {
  id 0
  type replicated
  min_size 1
  max_size 10
  step take hdd
  step chooseleaf firstn 0 type rack
  step emit
}

I can imagine with such design it's hard to achieve a good data
distribution. Right? Moreover, we are currently getting a lot of "near
full" warnings for several OSDs and had one full OSD, but the cluster
has more than 400 TB free space available from a total cluster size of
1,2 PB.

So what could we consider:

1) Changing the crush rule above in away to achieve better data distribution?

2) Adding a third "dummy" datacenter with 2 "dummy" racks and
distribute the hosts evenly across all datacenters? Then use a crush
rule with "step chooseleaf firstn 0 type datacenter".

3) Adding a third "dummy" datacenter each with on Rack and then
distribute the hosts evenly across? Then use a crush rule with "step
chooseleaf firstn 0 type datacenter".

4) Other suggestions?

We a currently running on Luminous 12.2.4.

Thanks for any feedback...

Best,
Martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question to avoid service stop when osd is full

2018-05-17 Thread 渥美 慶彦

Thank you, David.

I found "ceph osd pool set-quota" command.
I think using this command to SSD pool is useful to avoid the problem in 
quotation, isn't it?


best regards
On 2018/04/10 5:22, David Turner wrote:

The proper way to prevent this is to set your full ratios safe and monitor
your disk usage.  That will allow you to either clean up old data or add
new storage before you get to 95 full on any OSDs.  What I mean by setting
your full ratios safe is that if your use case can fill 20% of your disk
space within a couple days, then having your warnings start at 75% is too
high because you can easily fill up the rest of your space within a couple
days and then need more storage before you have it ready.

There is no method to allow read-only while OSDs are full.

On Mon, Apr 9, 2018 at 6:58 AM 渥美 慶彦 
wrote:


Hi,

I have 2 questions.

I want to use ceph for OpenStack's volume backend by creating 2 ceph pools.
One pool consists of osds on SSD, and the other consists of osds on HDD.
The storage capacity of SSD pool is much smaller than that of HDD pool,
so I want to make configuration not to stop all IO even if one osd on
SSD becomes full.
Is this possible?

"osd full ratio" is default to 0.95, and if one osd becomes full, then
all osd will stop.
Is there any configuration to allow us to read-only while one or more
osds are full?

best regards,

--

Atsumi Yoshihiko
E-mail:atsumi.yoshih...@po.ntt-tx.co.jp



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--

Yoshihiko Atsumi
E-mail:atsumi.yoshih...@po.ntt-tx.co.jp


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread David Turner
That's right. I didn't actually use Jewel for very long. I'm glad it worked
for you.

On Fri, May 11, 2018, 4:49 PM Webert de Souza Lima 
wrote:

> Thanks David.
> Although you mentioned this was introduced with Luminous, it's working
> with Jewel.
>
> ~# ceph osd pool stats
>
> Fri May 11 17:41:39 2018
>
> pool rbd id 5
>   client io 505 kB/s rd, 3801 kB/s wr, 46 op/s rd, 27 op/s wr
>
> pool rbd_cache id 6
>   client io 2538 kB/s rd, 3070 kB/s wr, 601 op/s rd, 758 op/s wr
>   cache tier io 12225 kB/s flush, 0 op/s promote, 3 PG(s) flushing
>
> pool cephfs_metadata id 7
>   client io 2233 kB/s rd, 2260 kB/s wr, 95 op/s rd, 587 op/s wr
>
> pool cephfs_data_ssd id 8
>   client io 1126 kB/s rd, 94897 B/s wr, 33 op/s rd, 42 op/s wr
>
> pool cephfs_data id 9
>   client io 0 B/s rd, 11203 kB/s wr, 12 op/s rd, 12 op/s wr
>
> pool cephfs_data_cache id 10
>   client io 4383 kB/s rd, 550 kB/s wr, 57 op/s rd, 39 op/s wr
>   cache tier io 7012 kB/s flush, 4399 kB/s evict, 11 op/s promote
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
>
> On Fri, May 11, 2018 at 5:14 PM David Turner 
> wrote:
>
>> `ceph osd pool stats` with the option to specify the pool you are
>> interested in should get you the breakdown of IO per pool.  This was
>> introduced with luminous.
>>
>> On Fri, May 11, 2018 at 2:39 PM Webert de Souza Lima <
>> webert.b...@gmail.com> wrote:
>>
>>> I think ceph doesn't have IO metrics will filters by pool right? I see
>>> IO metrics from clients only:
>>>
>>> ceph_client_io_ops
>>> ceph_client_io_read_bytes
>>> ceph_client_io_read_ops
>>> ceph_client_io_write_bytes
>>> ceph_client_io_write_ops
>>>
>>> and pool "byte" metrics, but not "io":
>>>
>>> ceph_pool(write/read)_bytes(_total)
>>>
>>> Regards,
>>>
>>> Webert Lima
>>> DevOps Engineer at MAV Tecnologia
>>> *Belo Horizonte - Brasil*
>>> *IRC NICK - WebertRLZ*
>>>
>>> On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima <
>>> webert.b...@gmail.com> wrote:
>>>
 Hey Jon!

 On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:

> It depends on the metadata intensity of your workload.  It might be
> quite interesting to gather some drive stats on how many IOPS are
> currently hitting your metadata pool over a week of normal activity.
>

 Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
 sure what I should be looking at).
 My current SSD disks have 2 partitions.
  - One is used for cephfs cache tier pool,
  - The other is used for both:  cephfs meta-data pool and cephfs
 data-ssd (this is an additional cephfs data pool with only ssds with file
 layout for a specific direcotory to use it)

 Because of this, iostat shows me peaks of 12k IOPS in the metadata
 partition, but this could definitely be IO for the data-ssd pool.


> If you are doing large file workloads, and the metadata mostly fits in
> RAM, then the number of IOPS from the MDS can be very, very low.  On
> the other hand, if you're doing random metadata reads from a small
> file workload where the metadata does not fit in RAM, almost every
> client read could generate a read operation, and each MDS could easily
> generate thousands of ops per second.
>

 I have yet to measure it the right way but I'd assume my metadata fits
 in RAM (a few 100s of MB only).

 This is an email hosting cluster with dozens of thousands of users so
 there are a lot of random reads and writes, but not too many small files.
 Email messages are concatenated together in files up to 4MB in size
 (when a rotation happens).
 Most user operations are dovecot's INDEX operations and I will keep
 index directory in a SSD-dedicaded pool.



> Isolating metadata OSDs is useful if the data OSDs are going to be
> completely saturated: metadata performance will be protected even if
> clients are hitting the data OSDs hard.
>

 This seems to be the case.


> If "heavy write" means completely saturating the cluster, then sharing
> the OSDs is risky.  If "heavy write" just means that there are more
> writes than reads, then it may be fine if the metadata workload is not
> heavy enough to make good use of SSDs.
>

 Saturarion will only happen in peak workloads, not often. By heavy
 write I mean there are much more writes than reads, yes.
 So I think I can start sharing the OSDs, if I think this is impacting
 performance I can just change the ruleset and move metadata to a SSD-only
 pool, right?


> The way I'd summarise this is: in the general case, dedicated SSDs are
> the safe way to go -- they're intrinsically better suited to metadata.
> However, in some quite common special cases, the overall 

Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread Webert de Souza Lima
Thanks David.
Although you mentioned this was introduced with Luminous, it's working with
Jewel.

~# ceph osd pool stats

Fri May 11 17:41:39 2018

pool rbd id 5
  client io 505 kB/s rd, 3801 kB/s wr, 46 op/s rd, 27 op/s wr

pool rbd_cache id 6
  client io 2538 kB/s rd, 3070 kB/s wr, 601 op/s rd, 758 op/s wr
  cache tier io 12225 kB/s flush, 0 op/s promote, 3 PG(s) flushing

pool cephfs_metadata id 7
  client io 2233 kB/s rd, 2260 kB/s wr, 95 op/s rd, 587 op/s wr

pool cephfs_data_ssd id 8
  client io 1126 kB/s rd, 94897 B/s wr, 33 op/s rd, 42 op/s wr

pool cephfs_data id 9
  client io 0 B/s rd, 11203 kB/s wr, 12 op/s rd, 12 op/s wr

pool cephfs_data_cache id 10
  client io 4383 kB/s rd, 550 kB/s wr, 57 op/s rd, 39 op/s wr
  cache tier io 7012 kB/s flush, 4399 kB/s evict, 11 op/s promote


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Fri, May 11, 2018 at 5:14 PM David Turner  wrote:

> `ceph osd pool stats` with the option to specify the pool you are
> interested in should get you the breakdown of IO per pool.  This was
> introduced with luminous.
>
> On Fri, May 11, 2018 at 2:39 PM Webert de Souza Lima <
> webert.b...@gmail.com> wrote:
>
>> I think ceph doesn't have IO metrics will filters by pool right? I see IO
>> metrics from clients only:
>>
>> ceph_client_io_ops
>> ceph_client_io_read_bytes
>> ceph_client_io_read_ops
>> ceph_client_io_write_bytes
>> ceph_client_io_write_ops
>>
>> and pool "byte" metrics, but not "io":
>>
>> ceph_pool(write/read)_bytes(_total)
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>> On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima <
>> webert.b...@gmail.com> wrote:
>>
>>> Hey Jon!
>>>
>>> On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:
>>>
 It depends on the metadata intensity of your workload.  It might be
 quite interesting to gather some drive stats on how many IOPS are
 currently hitting your metadata pool over a week of normal activity.

>>>
>>> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
>>> sure what I should be looking at).
>>> My current SSD disks have 2 partitions.
>>>  - One is used for cephfs cache tier pool,
>>>  - The other is used for both:  cephfs meta-data pool and cephfs
>>> data-ssd (this is an additional cephfs data pool with only ssds with file
>>> layout for a specific direcotory to use it)
>>>
>>> Because of this, iostat shows me peaks of 12k IOPS in the metadata
>>> partition, but this could definitely be IO for the data-ssd pool.
>>>
>>>
 If you are doing large file workloads, and the metadata mostly fits in
 RAM, then the number of IOPS from the MDS can be very, very low.  On
 the other hand, if you're doing random metadata reads from a small
 file workload where the metadata does not fit in RAM, almost every
 client read could generate a read operation, and each MDS could easily
 generate thousands of ops per second.

>>>
>>> I have yet to measure it the right way but I'd assume my metadata fits
>>> in RAM (a few 100s of MB only).
>>>
>>> This is an email hosting cluster with dozens of thousands of users so
>>> there are a lot of random reads and writes, but not too many small files.
>>> Email messages are concatenated together in files up to 4MB in size
>>> (when a rotation happens).
>>> Most user operations are dovecot's INDEX operations and I will keep
>>> index directory in a SSD-dedicaded pool.
>>>
>>>
>>>
 Isolating metadata OSDs is useful if the data OSDs are going to be
 completely saturated: metadata performance will be protected even if
 clients are hitting the data OSDs hard.

>>>
>>> This seems to be the case.
>>>
>>>
 If "heavy write" means completely saturating the cluster, then sharing
 the OSDs is risky.  If "heavy write" just means that there are more
 writes than reads, then it may be fine if the metadata workload is not
 heavy enough to make good use of SSDs.

>>>
>>> Saturarion will only happen in peak workloads, not often. By heavy write
>>> I mean there are much more writes than reads, yes.
>>> So I think I can start sharing the OSDs, if I think this is impacting
>>> performance I can just change the ruleset and move metadata to a SSD-only
>>> pool, right?
>>>
>>>
 The way I'd summarise this is: in the general case, dedicated SSDs are
 the safe way to go -- they're intrinsically better suited to metadata.
 However, in some quite common special cases, the overall number of
 metadata ops is so low that the device doesn't matter.
>>>
>>>
>>>
>>> Thank you very much John!
>>> Webert Lima
>>> DevOps Engineer at MAV Tecnologia
>>> Belo Horizonte - Brasil
>>> IRC NICK - WebertRLZ
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> 

Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread David Turner
`ceph osd pool stats` with the option to specify the pool you are
interested in should get you the breakdown of IO per pool.  This was
introduced with luminous.

On Fri, May 11, 2018 at 2:39 PM Webert de Souza Lima 
wrote:

> I think ceph doesn't have IO metrics will filters by pool right? I see IO
> metrics from clients only:
>
> ceph_client_io_ops
> ceph_client_io_read_bytes
> ceph_client_io_read_ops
> ceph_client_io_write_bytes
> ceph_client_io_write_ops
>
> and pool "byte" metrics, but not "io":
>
> ceph_pool(write/read)_bytes(_total)
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
> On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima 
> wrote:
>
>> Hey Jon!
>>
>> On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:
>>
>>> It depends on the metadata intensity of your workload.  It might be
>>> quite interesting to gather some drive stats on how many IOPS are
>>> currently hitting your metadata pool over a week of normal activity.
>>>
>>
>> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
>> sure what I should be looking at).
>> My current SSD disks have 2 partitions.
>>  - One is used for cephfs cache tier pool,
>>  - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
>> (this is an additional cephfs data pool with only ssds with file layout for
>> a specific direcotory to use it)
>>
>> Because of this, iostat shows me peaks of 12k IOPS in the metadata
>> partition, but this could definitely be IO for the data-ssd pool.
>>
>>
>>> If you are doing large file workloads, and the metadata mostly fits in
>>> RAM, then the number of IOPS from the MDS can be very, very low.  On
>>> the other hand, if you're doing random metadata reads from a small
>>> file workload where the metadata does not fit in RAM, almost every
>>> client read could generate a read operation, and each MDS could easily
>>> generate thousands of ops per second.
>>>
>>
>> I have yet to measure it the right way but I'd assume my metadata fits in
>> RAM (a few 100s of MB only).
>>
>> This is an email hosting cluster with dozens of thousands of users so
>> there are a lot of random reads and writes, but not too many small files.
>> Email messages are concatenated together in files up to 4MB in size (when
>> a rotation happens).
>> Most user operations are dovecot's INDEX operations and I will keep index
>> directory in a SSD-dedicaded pool.
>>
>>
>>
>>> Isolating metadata OSDs is useful if the data OSDs are going to be
>>> completely saturated: metadata performance will be protected even if
>>> clients are hitting the data OSDs hard.
>>>
>>
>> This seems to be the case.
>>
>>
>>> If "heavy write" means completely saturating the cluster, then sharing
>>> the OSDs is risky.  If "heavy write" just means that there are more
>>> writes than reads, then it may be fine if the metadata workload is not
>>> heavy enough to make good use of SSDs.
>>>
>>
>> Saturarion will only happen in peak workloads, not often. By heavy write
>> I mean there are much more writes than reads, yes.
>> So I think I can start sharing the OSDs, if I think this is impacting
>> performance I can just change the ruleset and move metadata to a SSD-only
>> pool, right?
>>
>>
>>> The way I'd summarise this is: in the general case, dedicated SSDs are
>>> the safe way to go -- they're intrinsically better suited to metadata.
>>> However, in some quite common special cases, the overall number of
>>> metadata ops is so low that the device doesn't matter.
>>
>>
>>
>> Thank you very much John!
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> Belo Horizonte - Brasil
>> IRC NICK - WebertRLZ
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread Webert de Souza Lima
I think ceph doesn't have IO metrics will filters by pool right? I see IO
metrics from clients only:

ceph_client_io_ops
ceph_client_io_read_bytes
ceph_client_io_read_ops
ceph_client_io_write_bytes
ceph_client_io_write_ops

and pool "byte" metrics, but not "io":

ceph_pool(write/read)_bytes(_total)

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima 
wrote:

> Hey Jon!
>
> On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:
>
>> It depends on the metadata intensity of your workload.  It might be
>> quite interesting to gather some drive stats on how many IOPS are
>> currently hitting your metadata pool over a week of normal activity.
>>
>
> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
> sure what I should be looking at).
> My current SSD disks have 2 partitions.
>  - One is used for cephfs cache tier pool,
>  - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
> (this is an additional cephfs data pool with only ssds with file layout for
> a specific direcotory to use it)
>
> Because of this, iostat shows me peaks of 12k IOPS in the metadata
> partition, but this could definitely be IO for the data-ssd pool.
>
>
>> If you are doing large file workloads, and the metadata mostly fits in
>> RAM, then the number of IOPS from the MDS can be very, very low.  On
>> the other hand, if you're doing random metadata reads from a small
>> file workload where the metadata does not fit in RAM, almost every
>> client read could generate a read operation, and each MDS could easily
>> generate thousands of ops per second.
>>
>
> I have yet to measure it the right way but I'd assume my metadata fits in
> RAM (a few 100s of MB only).
>
> This is an email hosting cluster with dozens of thousands of users so
> there are a lot of random reads and writes, but not too many small files.
> Email messages are concatenated together in files up to 4MB in size (when
> a rotation happens).
> Most user operations are dovecot's INDEX operations and I will keep index
> directory in a SSD-dedicaded pool.
>
>
>
>> Isolating metadata OSDs is useful if the data OSDs are going to be
>> completely saturated: metadata performance will be protected even if
>> clients are hitting the data OSDs hard.
>>
>
> This seems to be the case.
>
>
>> If "heavy write" means completely saturating the cluster, then sharing
>> the OSDs is risky.  If "heavy write" just means that there are more
>> writes than reads, then it may be fine if the metadata workload is not
>> heavy enough to make good use of SSDs.
>>
>
> Saturarion will only happen in peak workloads, not often. By heavy write I
> mean there are much more writes than reads, yes.
> So I think I can start sharing the OSDs, if I think this is impacting
> performance I can just change the ruleset and move metadata to a SSD-only
> pool, right?
>
>
>> The way I'd summarise this is: in the general case, dedicated SSDs are
>> the safe way to go -- they're intrinsically better suited to metadata.
>> However, in some quite common special cases, the overall number of
>> metadata ops is so low that the device doesn't matter.
>
>
>
> Thank you very much John!
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-09 Thread Webert de Souza Lima
Hey Jon!

On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:

> It depends on the metadata intensity of your workload.  It might be
> quite interesting to gather some drive stats on how many IOPS are
> currently hitting your metadata pool over a week of normal activity.
>

Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
sure what I should be looking at).
My current SSD disks have 2 partitions.
 - One is used for cephfs cache tier pool,
 - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
(this is an additional cephfs data pool with only ssds with file layout for
a specific direcotory to use it)

Because of this, iostat shows me peaks of 12k IOPS in the metadata
partition, but this could definitely be IO for the data-ssd pool.


> If you are doing large file workloads, and the metadata mostly fits in
> RAM, then the number of IOPS from the MDS can be very, very low.  On
> the other hand, if you're doing random metadata reads from a small
> file workload where the metadata does not fit in RAM, almost every
> client read could generate a read operation, and each MDS could easily
> generate thousands of ops per second.
>

I have yet to measure it the right way but I'd assume my metadata fits in
RAM (a few 100s of MB only).

This is an email hosting cluster with dozens of thousands of users so there
are a lot of random reads and writes, but not too many small files.
Email messages are concatenated together in files up to 4MB in size (when a
rotation happens).
Most user operations are dovecot's INDEX operations and I will keep index
directory in a SSD-dedicaded pool.



> Isolating metadata OSDs is useful if the data OSDs are going to be
> completely saturated: metadata performance will be protected even if
> clients are hitting the data OSDs hard.
>

This seems to be the case.


> If "heavy write" means completely saturating the cluster, then sharing
> the OSDs is risky.  If "heavy write" just means that there are more
> writes than reads, then it may be fine if the metadata workload is not
> heavy enough to make good use of SSDs.
>

Saturarion will only happen in peak workloads, not often. By heavy write I
mean there are much more writes than reads, yes.
So I think I can start sharing the OSDs, if I think this is impacting
performance I can just change the ruleset and move metadata to a SSD-only
pool, right?


> The way I'd summarise this is: in the general case, dedicated SSDs are
> the safe way to go -- they're intrinsically better suited to metadata.
> However, in some quite common special cases, the overall number of
> metadata ops is so low that the device doesn't matter.



Thank you very much John!
Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-09 Thread John Spray
On Wed, May 9, 2018 at 3:32 PM, Webert de Souza Lima
 wrote:
> Hello,
>
> Currently, I run Jewel + Filestore for cephfs, with SSD-only pools used for
> cephfs-metadata, and HDD-only pools for cephfs-data. The current
> metadata/data ratio is something like 0,25% (50GB metadata for 20TB data).
>
> Regarding bluestore architecture, assuming I have:
>
>  - SSDs for WAL+DB
>  - Spinning Disks for bluestore data.
>
> would you recommend still store metadata in SSD-Only OSD nodes?

It depends on the metadata intensity of your workload.  It might be
quite interesting to gather some drive stats on how many IOPS are
currently hitting your metadata pool over a week of normal activity.

The primary reason for using SSDs for metadata is the cost-per-IOP.
SSDs are generally cheaper per operation than HDDs, so if you've got
enough IOPS to occupy an SSD then it's a no-brainer cost saving to use
SSDs (performance benefits are just a bonus).

If you are doing large file workloads, and the metadata mostly fits in
RAM, then the number of IOPS from the MDS can be very, very low.  On
the other hand, if you're doing random metadata reads from a small
file workload where the metadata does not fit in RAM, almost every
client read could generate a read operation, and each MDS could easily
generate thousands of ops per second.

> If not, is it recommended to dedicate some OSDs (Spindle+SSD for WAL/DB) for
> cephfs-metadata?

Isolating metadata OSDs is useful if the data OSDs are going to be
completely saturated: metadata performance will be protected even if
clients are hitting the data OSDs hard.

However, if your OSDs outnumber clients such that the clients couldn't
possibly saturate the OSDs, then you don't have this issue.

> If I just have 2 pools (metadata and data) all sharing the same OSDs in the
> cluster, would it be enough for heavy-write cases?

If "heavy write" means completely saturating the cluster, then sharing
the OSDs is risky.  If "heavy write" just means that there are more
writes than reads, then it may be fine if the metadata workload is not
heavy enough to make good use of SSDs.

The way I'd summarise this is: in the general case, dedicated SSDs are
the safe way to go -- they're intrinsically better suited to metadata.
However, in some quite common special cases, the overall number of
metadata ops is so low that the device doesn't matter.

John

> Assuming min_size=2, size=3.
>
> Thanks for your thoughts.
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question to avoid service stop when osd is full

2018-04-09 Thread David Turner
The proper way to prevent this is to set your full ratios safe and monitor
your disk usage.  That will allow you to either clean up old data or add
new storage before you get to 95 full on any OSDs.  What I mean by setting
your full ratios safe is that if your use case can fill 20% of your disk
space within a couple days, then having your warnings start at 75% is too
high because you can easily fill up the rest of your space within a couple
days and then need more storage before you have it ready.

There is no method to allow read-only while OSDs are full.

On Mon, Apr 9, 2018 at 6:58 AM 渥美 慶彦 
wrote:

> Hi,
>
> I have 2 questions.
>
> I want to use ceph for OpenStack's volume backend by creating 2 ceph pools.
> One pool consists of osds on SSD, and the other consists of osds on HDD.
> The storage capacity of SSD pool is much smaller than that of HDD pool,
> so I want to make configuration not to stop all IO even if one osd on
> SSD becomes full.
> Is this possible?
>
> "osd full ratio" is default to 0.95, and if one osd becomes full, then
> all osd will stop.
> Is there any configuration to allow us to read-only while one or more
> osds are full?
>
> best regards,
>
> --
> 
> Atsumi Yoshihiko
> E-mail:atsumi.yoshih...@po.ntt-tx.co.jp
> 
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question to avoid service stop when osd is full

2018-04-09 Thread 渥美 慶彦

Hi,

I have 2 questions.

I want to use ceph for OpenStack's volume backend by creating 2 ceph pools.
One pool consists of osds on SSD, and the other consists of osds on HDD.
The storage capacity of SSD pool is much smaller than that of HDD pool,
so I want to make configuration not to stop all IO even if one osd on 
SSD becomes full.

Is this possible?

"osd full ratio" is default to 0.95, and if one osd becomes full, then 
all osd will stop.
Is there any configuration to allow us to read-only while one or more 
osds are full?


best regards,

--

Atsumi Yoshihiko
E-mail:atsumi.yoshih...@po.ntt-tx.co.jp



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about Erasure-coding clusters and resiliency

2018-02-13 Thread Caspar Smit
Hi Tim,

With the current setup you can only handle 1 host failure without loosing
any data, BUT everything will probably freeze until you bring the failed
node (or the OSD"s in it) back up.

Your setup indicates k=6, m=2 and all 8 shards are distributed to 4 hosts
(2 shards/osds per host). Be aware that a pool which uses this erasure code
profile will have a min_size of 7! (min_size = k+1)
So this means in case of a node failure there are only 6 shards available
so no writes are then accepted to the pool -> freeze of i/o.

If you change the profile to k=5 and m=3 you can have a node failure
without freezing i/o. (min_size = 6)

If you want to sustain 2 node failures you must increase the m even further:

for instance k=7, m=5

step choose indep 6 type host
step choose indep 2 type osd

this will distribute the 12 (k+m) shards over your 6 hosts (2 shards per
host)

min_size = 8 so you can have 2 node failures without freezing i/o.

Caspar

2018-02-08 21:43 GMT+01:00 Tim Gipson :

> Hey all,
>
> We are trying to get an erasure coding cluster up and running but we are
> having a problem getting the cluster to remain up if we lose an OSD host.
>
> Currently we have 6 OSD hosts with 6 OSDs a piece.  I'm trying to build an
> EC profile and a crush rule that will allow the cluster to continue running
> if we lose a host, but I seem to misunderstand how the configuration of an
> EC pool/cluster is supposed to be implemented.  I would like to be able to
> set this up to allow for 2 host failures before data loss occurs.
>
> Here is my crush rule:
>
> {
> "rule_id": 2,
> "rule_name": "EC_ENA",
> "ruleset": 2,
> "type": 3,
> "min_size": 6,
> "max_size": 8,
> "steps": [
> {
> "op": "take",
> "item": -1,
> "item_name": "default"
> },
> {
> "op": "choose_indep",
> "num": 4,
> "type": "host"
> },
> {
> "op": "choose_indep",
> "num": 2,
> "type": "osd"
> },
> {
> "op": "emit"
> }
> ]
> }
>
> Here is my EC profile:
>
> crush-device-class=
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=6
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
>
> Any direction or help would be greatly appreciated.
>
> Thanks,
>
> Tim Gipson
> Systems Engineer
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about Erasure-coding clusters and resiliency

2018-02-08 Thread Tim Gipson
Hey all,

We are trying to get an erasure coding cluster up and running but we are having 
a problem getting the cluster to remain up if we lose an OSD host.  

Currently we have 6 OSD hosts with 6 OSDs a piece.  I'm trying to build an EC 
profile and a crush rule that will allow the cluster to continue running if we 
lose a host, but I seem to misunderstand how the configuration of an EC 
pool/cluster is supposed to be implemented.  I would like to be able to set 
this up to allow for 2 host failures before data loss occurs.

Here is my crush rule:

{
"rule_id": 2,
"rule_name": "EC_ENA",
"ruleset": 2,
"type": 3,
"min_size": 6,
"max_size": 8,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "choose_indep",
"num": 4,
"type": "host"
},
{
"op": "choose_indep",
"num": 2,
"type": "osd"
},
{
"op": "emit"
}
]
}

Here is my EC profile:

crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=6
m=2
plugin=jerasure
technique=reed_sol_van
w=8

Any direction or help would be greatly appreciated.

Thanks,

Tim Gipson
Systems Engineer

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on rbd resize

2018-01-03 Thread Richard Hesketh
No, most filesystems can be expanded pretty trivially (shrinking is a more 
complex operation but usually also doable). Assuming the likely case of an 
ext2/3/4 filesystem, the command "resize2fs /dev/rbd0" should resize the FS to 
cover the available space in the block device.

Rich

On 03/01/18 13:12, 13605702...@163.com wrote:
> hi Jason
> 
> the data won't be lost if i resize the filesystem in the image? 
> 
> thanks



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on rbd resize

2018-01-03 Thread 13605702...@163.com
hi Jason

the data won't be lost if i resize the filesystem in the image? 

thanks



13605702...@163.com
 
From: Jason Dillaman
Date: 2018-01-03 20:57
To: 13605702...@163.com
CC: ceph-users
Subject: Re: [ceph-users] question on rbd resize
You need to resize the filesystem within the RBD block device.
 
On Wed, Jan 3, 2018 at 7:37 AM, 13605702...@163.com <13605702...@163.com> wrote:
> hi
>
> a rbd image is out of space (old size is 1GB), so i resize it to 10GB
>
> # rbd info rbd/test
> rbd image 'test':
> size 10240 MB in 2560 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.1169238e1f29
> format: 2
> features: layering
> flags:
>
> and then i remap and remount the image on the client, but the size of the
> image is still 1GB, and it is full !
> /dev/rbd0  1014M 1014M   20K 100% /mnt
>
> how i can resize the image without lost the data in the image?
>
> thanks
>
> 
> 13605702...@163.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
 
 
 
-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on rbd resize

2018-01-03 Thread Jason Dillaman
You need to resize the filesystem within the RBD block device.

On Wed, Jan 3, 2018 at 7:37 AM, 13605702...@163.com <13605702...@163.com> wrote:
> hi
>
> a rbd image is out of space (old size is 1GB), so i resize it to 10GB
>
> # rbd info rbd/test
> rbd image 'test':
> size 10240 MB in 2560 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.1169238e1f29
> format: 2
> features: layering
> flags:
>
> and then i remap and remount the image on the client, but the size of the
> image is still 1GB, and it is full !
> /dev/rbd0  1014M 1014M   20K 100% /mnt
>
> how i can resize the image without lost the data in the image?
>
> thanks
>
> 
> 13605702...@163.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question on rbd resize

2018-01-03 Thread 13605702...@163.com
hi 

a rbd image is out of space (old size is 1GB), so i resize it to 10GB

# rbd info rbd/test
rbd image 'test':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.1169238e1f29
format: 2
features: layering
flags: 

and then i remap and remount the image on the client, but the size of the image 
is still 1GB, and it is full !
/dev/rbd0  1014M 1014M   20K 100% /mnt

how i can resize the image without lost the data in the image?

thanks



13605702...@163.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about librbd with qemu-kvm

2018-01-02 Thread Alexandre DERUMIER
It's not possible to use multiple threads by disk in qemu currently. (It's on 
qemu roadmap).

but you can create multiple disk/rbd image and use multiple qemu iothreads. (1 
by disk).


(BTW, I'm able to reach around 70k iops max with 4k read, with 3,1ghz cpu, 
rbd_cache=none, disabling debug and cephx in ceph.conf)


- Mail original -
De: "冷镇宇" <lengzhe...@ict.ac.cn>
À: "ceph-users" <ceph-us...@ceph.com>
Envoyé: Mardi 2 Janvier 2018 04:01:39
Objet: [ceph-users] Question about librbd with qemu-kvm



Hi all, 

I am using librbd of Ceph10.2.0 with Qemu-kvm. When the virtual machine booted, 
I found that there is only one tp_librbd thread for one rbd image. Then the 
iops of 4KB read for one rbd image is only 20,000. I'm wondering if there are 
some configures for librbd in qemu which can add librbd threads for one rbd 
image. Can someone help me? Thank you very much. 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about librbd with qemu-kvm

2018-01-01 Thread 冷镇宇
Hi all,

I am using librbd of Ceph10.2.0 with Qemu-kvm. When the virtual machine booted, 
I found that there is only one tp_librbd thread for one rbd image. Then the 
iops of 4KB read for one rbd image is only 20,000. I'm wondering if there are 
some configures for librbd in qemu which can add librbd threads for one rbd 
image. Can someone help me? Thank you very much.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about BUG #11332

2017-12-04 Thread Gregory Farnum
On Thu, Nov 23, 2017 at 1:55 AM 许雪寒  wrote:

> Hi, everyone.
>
>  We also encountered this problem: http://tracker.ceph.com/issues/11332.
> And we found that this seems to be caused by the lack of mutual exclusion
> between applying "trim" and handling subscriptions. Since
> "build_incremental" operations doesn't go through the "PAXOS" procedure,
> and applying "trim" contains two phases, which are modifying "mondbstore"
> and updating "cached_first_committed", there could be a chance for
> "send_incremental" operations to happen between them. What's more,
> "build_incremental" operations also contain two phases, getting
> "cached_first_committed" and getting actual incrementals for MonDBStore.
> So, if "build_incremental" do happens concurrently with applying "trim", it
> could get an out-dated "cached_first_committed" and try to read a full map
> whose already trimmed.
>
> Is this right?
>

I don't think this is right. Keep in mind that the monitors are basically a
single-threaded event-driven machine. Both trimming and building
incrementals happen in direct response to receiving messages, in the main
dispatch loop, and while trimming is happening the PaxosService is not
readable. So it won't be invoking build_incremental() and they won't run
concurrently.
-Greg


>
> If it is, we think maybe all “READ” operations in monitor should be
> synchronized with paxos commit. Right? Should some kind of read-write
> locking mechanism be used here?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question pool usage vs. pool usage raw

2017-11-23 Thread Konstantin Shalygin
In next time: submit "Answer to All" button, for make copy of your 
message in ML.



On 11/24/2017 12:45 AM, bernhard glomm wrote:
and is there ANY way to figure out how much space is being additional 
consumed by the snapshots at the moment (either by pool, preferable or 
by cluster?)


The way is: "rbd help disk-usage"

--
Best regards,
Konstantin Shalygin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question pool usage vs. pool usage raw

2017-11-23 Thread Konstantin Shalygin

What is the difference between the "usage" and the "raw usage" of a pool?
Usage - is your data. Raw - is what actually your data use with all 
copies (pool 'size' option). I.e. if your data is 1000G - your raw is 3000G.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question pool usage vs. pool usage raw

2017-11-23 Thread bernhard.glomm
this question has probably come up before but I couldn't find any clear 
answer so far.


What is the difference between the "usage" and the "raw usage" of a 
pool?


We are using only rbds (no rados no cephfs as for now)
On the first glance the "sum of all rbd sizes" equals the "usage". 
Right?

(rbs size as give by 'rbd ls -l  | egrep -v ')
Even though the rbds are far from beeing filled with data, i.e.


Is the "raw usage" the "usage" + "space needed by all snapshots" (plus 
some overhead) in the pool?

(which unfortunately is NOT the sum of all 'rbd ls -l ')

Than the size of the pool would be "raw usage"+"available space"? Right?

In the results of "rados df" or "ceph df" the sum of usage per pool is - 
at the moment - less than 50% of the global usage.
And the global size of the cluster doesn't help me at all, since some 
pools are smaller, more expensive and much slower filling than others,
so I would trigger a notification on a much higher mark than on a huge 
cheap pool.
But I can't find a reliable way to figure out the size of a given pool 
(other than "usage + availabel", or is it "raw usage + available"?)


TIA

Bernhard


Ecologic Institut gemeinnuetzige GmbH
Pfalzburger Str. 43/44, D-10717 Berlin
Geschaeftsfuehrerin / Director: Dr. Camilla Bausch
Sitz der Gesellschaft / Registered Office: Berlin (Germany)
Registergericht / Court of Registration: Amtsgericht Berlin (Charlottenburg), 
HRB 57947
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about BUG #11332

2017-11-23 Thread 许雪寒
Hi, everyone.

 We also encountered this problem: http://tracker.ceph.com/issues/11332. And we 
found that this seems to be caused by the lack of mutual exclusion between 
applying "trim" and handling subscriptions. Since "build_incremental" 
operations doesn't go through the "PAXOS" procedure, and applying "trim" 
contains two phases, which are modifying "mondbstore" and updating 
"cached_first_committed", there could be a chance for "send_incremental" 
operations to happen between them. What's more, "build_incremental" operations 
also contain two phases, getting "cached_first_committed" and getting actual 
incrementals for MonDBStore. So, if "build_incremental" do happens concurrently 
with applying "trim", it could get an out-dated "cached_first_committed" and 
try to read a full map whose already trimmed.

Is this right?

If it is, we think maybe all “READ” operations in monitor should be 
synchronized with paxos commit. Right? Should some kind of read-write locking 
mechanism be used here?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question regarding filestore on Luminous

2017-09-25 Thread Alan Johnson
I am trying to compare FileStore performance against Bluestore. With Luminous 
12.20,  Bluestore is working fine but if I try and create a Filestore volume 
with a separate journal using  Jewel like Syntax - "ceph-deploy osd create 
:sdb:nvme0n1", device nvme0n1 is ignored and it sets up two partitions 
(similar to BlueStore) as shown below:
Number  Start   End SizeFile system  NameFlags
1  1049kB  106MB   105MB   xfs  ceph data
2  106MB   6001GB  6001GB   ceph block

Is this expected behavior or is FIleStore no longer supported with Luminous?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about the Ceph's performance with spdk

2017-09-21 Thread Alejandro Comisario
Bump ! i saw this on the documentation for Bluestore also
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#spdk-usage

Does anyone has any experience ?

On Thu, Jun 8, 2017 at 2:27 AM, Li,Datong  wrote:

> Hi all,
>
> I’m new in Ceph, and I wonder to know the performance report exactly about
> Ceph’s spdk, but I couldn’t find it. The most thing I want to know is the
> performance improvement before spdk and after.
>
> Thanks,
> Datong Li
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about rbd-mirror

2017-06-29 Thread Jason Dillaman
On Wed, Jun 28, 2017 at 11:42 PM, YuShengzuo  wrote:
> Hi Jason Dillaman,
>
>
>
> I am using rbd-mirror now (release Jewel).
>
>
>
> 1.
>
> And in many webs or other information introduced rbd-mirror notices that two
> ceph cluster should be the ‘same fsid’.
>
> But It’s nothing bad or wrong when I deploy rbd-mirror between any two cephs
>
> and choose one to be my Openstack cinder backend and the other one to be
> replication ceph.
>
> In some it’s OK in simple test cases, such as doing failover.
>
> So what it is the reason to recommend using the ‘same fsid’

There is no requirement to have the same "fsid" between clusters. Did
you see that in any of our documentation? The only similar requirement
is that the pools be named the same between the two clusters.

>
> 2.
>
> Last week, the Luminous released where I read in ceph.com.
>
>
>
> I notice it is supported something HA wrote in that blog.
>
> Is there any more information about it in detail?  It is multi-rbd-mirror?
>

Yes, for Luminous, you can now run multiple rbd-mirror daemons
concurrently for built-in HA. The daemons will automatically elect one
alive process as the leader and will recover from the failure of the
leader automatically. Previously, bad things would happen if you
attempted to run multiple rbd-mirror daemon processes concurrently.


>
> Hope to get your return, thanks.
>
>
>
>
>
> --
>
>
> Yu Shengzuo
>
> e-mail: yu.sheng...@99cloud.net
>
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about upgrading ceph clusters from Hammer to Jewel

2017-06-19 Thread 许雪寒
Hi, everyone.

I intend to upgrade one of our ceph clusters from Hammer to Jewel, I wonder in 
what order I should upgrade the MON, OSD and LIBRBD? Is there any problem to 
have some of these components running Hammer version while others running Jewel 
version? Do I have to upgrade QEMU as well to adapt to the Jewel version’s 
LIBRBD?

Thank you:)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about the Ceph's performance with spdk

2017-06-08 Thread Li,Datong
Hi all, 

I’m new in Ceph, and I wonder to know the performance report exactly about 
Ceph’s spdk, but I couldn’t find it. The most thing I want to know is the 
performance improvement before spdk and after.

Thanks,
Datong Li

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about PGMonitor::waiting_for_finished_proposal

2017-06-01 Thread Joao Eduardo Luis

On 06/01/2017 05:35 AM, 许雪寒 wrote:

Hi, everyone.

Recently, I’m reading the source code of Monitor. I found that, in

PGMonitor::preprare_pg_stats() method, a callback C_Stats is put into
PGMonitor::waiting_for_finished_proposal. I wonder, if a previous PGMap
incremental is in PAXOS's proposeaccept phase at the moment C_Stats
is put into PGMonitor::waiting_for_finished_proposal, would this C_Stats
be called when that PGMap incremental's PAXOS procedure is complete and
PaxosService::_active() is invoked? If so, there exists the possibility
that a MPGStats request get responsed before going through the PAXOS
procedure.


Is this right? Thank you:-)


Much like the other PaxosServices, the PGMonitor will only handle 
requests with potential side-effects (i.e., updates) if the service is 
writeable.


A precondition on being writeable is not having a PGMonitor proposal 
currently in progress. Other proposals, from other PaxosServices, may be 
happening, but not from PGMonitor.


When your request reaches PGMonitor::prepare_pg_stats(), it is 
guaranteed (except in case of unexpected behavior) that the service is 
not currently undergoing a proposal.


This means that when we queue C_Stats waiting for a finished proposal, 
it will be called back upon once the next proposal finishes.


We may bundle other update requests to PGMonitor (much like what happens 
on other PaxosServices) into the same proposal. In which case, all the 
callbacks that were waiting for a finished proposal will be woken up 
once the proposal is finished.


So, to answer your question, no.

  -Joao

P.S.: If you are curious to know how the writeable decision is made, 
check out PaxosServices::dispatch().


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about PGMonitor::waiting_for_finished_proposal

2017-05-31 Thread 许雪寒
Hi, everyone. 

Recently, I’m reading the source code of Monitor. I found that, in 
PGMonitor::preprare_pg_stats() method, a callback C_Stats is put into 
PGMonitor::waiting_for_finished_proposal. I wonder, if a previous PGMap 
incremental is in PAXOS's proposeaccept phase at the moment C_Stats is put 
into PGMonitor::waiting_for_finished_proposal, would this C_Stats be called 
when that PGMap incremental's PAXOS procedure is complete and 
PaxosService::_active() is invoked? If so, there exists the possibility that a 
MPGStats request get responsed before going through the PAXOS procedure.

Is this right? Thank you:-)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Question] RBD Striping

2017-04-28 Thread Jason Dillaman
Here is a background on Ceph striping [1]. By default, RBD will stripe
data with a stripe unit of 4MB and a stripe count of 1. Decreasing the
default RBD image object size will balloon the number of objects in
your backing Ceph cluster but will also result in less data to copy
during snapshot and clone CoW operations. Using "fancy" stripe
settings can improve performance under small, sequential IO operations
since the ops can be executing in parallel by multiple OSDs.


[1] http://docs.ceph.com/docs/master/architecture/#data-striping

On Thu, Apr 27, 2017 at 10:13 AM, Timofey Titovets  wrote:
> Hi, i found that RBD Striping documentation are not enough detail.
> Can some one explain how RBD stripe own object over more objects and
> why it's better use striping instead of small rbd object size?
>
> Also if RBD use object size = 4MB by default does it's mean that every
> time object has modified OSD read 4MB of data and replicate it,
> instead of only changes?
> If yes, can striping help with that?
>
> Thanks for any answer
>
> --
> Have a nice day,
> Timofey.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [Question] RBD Striping

2017-04-27 Thread Timofey Titovets
Hi, i found that RBD Striping documentation are not enough detail.
Can some one explain how RBD stripe own object over more objects and
why it's better use striping instead of small rbd object size?

Also if RBD use object size = 4MB by default does it's mean that every
time object has modified OSD read 4MB of data and replicate it,
instead of only changes?
If yes, can striping help with that?

Thanks for any answer

-- 
Have a nice day,
Timofey.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about the OSD host option

2017-04-26 Thread Gregory Farnum
On Fri, Apr 21, 2017 at 12:07 PM, Fabian  wrote:
> Hi Everyone,
>
> I play a bit around with ceph on a test cluster with 3 servers (each MON
> and OSD at the same time).
> I use some self written ansible rules to deploy the config and crate
> the OSD with ceph-disk. Because ceph-disk use the next free OSD-ID, my
> ansible scrip is not aware which ID belongs to which OSD and host. So I
> don't create any [OSD.ID] section in my config and my cluster runs
> fine.
>
> Now I have read in [1] "the Ceph configuration file MUST specify the
> host for each daemon". As I consider each OSD as daemon, I'm a bit
> confused that it worked without the host specified.
>
> Why do the OSD daemon need the host option? What happened if it doesn't
> exist?

I think this statement just an inaccurate leftover from the days of
ceph-init when you used a monolithic ceph.conf file and the startup
scripts parsed through that instead of looking for
appropriately-tagged directories/disks. :/
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about the OSD host option

2017-04-22 Thread Henrik Korkuc
mon.* and osd.* sections are not mandatory in config. So unless you want 
to set something per daemon, you can skip them completely.


On 17-04-21 19:07, Fabian wrote:

Hi Everyone,

I play a bit around with ceph on a test cluster with 3 servers (each MON
and OSD at the same time).
I use some self written ansible rules to deploy the config and crate
the OSD with ceph-disk. Because ceph-disk use the next free OSD-ID, my
ansible scrip is not aware which ID belongs to which OSD and host. So I
don't create any [OSD.ID] section in my config and my cluster runs
fine.

Now I have read in [1] "the Ceph configuration file MUST specify the
host for each daemon". As I consider each OSD as daemon, I'm a bit
confused that it worked without the host specified.

Why do the OSD daemon need the host option? What happened if it doesn't
exist?

Is there any best practice about naming the OSDs? Or a trick to avoid
the [OSD.ID] for each daemon?

[1]http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/#ceph-daemons

Thank you,

Fabian


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about the OSD host option

2017-04-21 Thread Fabian
Hi Everyone,

I play a bit around with ceph on a test cluster with 3 servers (each MON
and OSD at the same time). 
I use some self written ansible rules to deploy the config and crate
the OSD with ceph-disk. Because ceph-disk use the next free OSD-ID, my
ansible scrip is not aware which ID belongs to which OSD and host. So I
don't create any [OSD.ID] section in my config and my cluster runs
fine. 

Now I have read in [1] "the Ceph configuration file MUST specify the
host for each daemon". As I consider each OSD as daemon, I'm a bit
confused that it worked without the host specified. 

Why do the OSD daemon need the host option? What happened if it doesn't
exist? 

Is there any best practice about naming the OSDs? Or a trick to avoid
the [OSD.ID] for each daemon?

[1]http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/#ceph-daemons

Thank you,

Fabian


pgpzxV4_h77UC.pgp
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about RadosGW subusers

2017-04-13 Thread Ben Hines
Based on past LTS release dates would predict Luminous much sooner than
that, possibly even in May...  http://docs.ceph.com/docs/master/releases/

The docs also say "Spring" http://docs.ceph.com/docs/master/release-notes/

-Ben

On Thu, Apr 13, 2017 at 12:11 PM, <ceph.nov...@habmalnefrage.de> wrote:

> Thanks a lot, Trey.
>
> I'll try that stuff next week, once back from Easter holidays.
> And some "multi site" and "metasearch" is also still on my to-be-tested
> list. Need badly to free up some time for all the interesting "future of
> storage" things.
>
> BTW., we are on Kraken and I'd hope to see more of the new and shiny stuff
> here soon (something like 11.2.X) instead of waiting for Luminous late
> 2017. Not sure how the CEPH release policy is usually?!
>
> Anyhow, thanks and happy Easter everyone!
> Anton
>
>
> Gesendet: Donnerstag, 13. April 2017 um 20:15 Uhr
> Von: "Trey Palmer" <t...@mailchimp.com>
> An: ceph.nov...@habmalnefrage.de
> Cc: "Trey Palmer" <t...@mailchimp.com>, ceph-us...@ceph.com
> Betreff: Re: [ceph-users] Question about RadosGW subusers
>
> Anton,
>
> It turns out that Adam Emerson is trying to get bucket policies and roles
> merged in time for Luminous:
>
> https://github.com/ceph/ceph/pull/14307
>
> Given this, I think we will only be using subusers temporarily as a method
> to track which human or service did what in which bucket.  This seems to us
> much easier than trying to deal with ACL's without any concept of groups,
> roles, or policies, in buckets that can often have millions of objects.
>
> Here is the general idea:
>
>
> 1.  Each bucket has a user ("master user"), but we don't use or issue that
> set of keys at all.
>
>
> radosgw-admin user create --uid=mybucket --display-name="My Bucket"
>
> You can of course have multiple buckets per user but so far for us it has
> been simple to have one user per bucket, with the username the same as the
> bucket name.   If a human needs access to more than one bucket, we will
> create multiple subusers for them.   That's not convenient, but it's
> temporary.
>
> So what we're doing is effectively making the user into the group, with
> the subusers being the users, and each user only capable of being in one
> group.   Very suboptimal, but better than the total chaos that would result
> from giving everyone the same set of keys for a given bucket.
>
>
> 2.  For each human user or service/machine user of that bucket, we create
> subusers.You can do this via:
>
> ## full-control ops user
> radosgw-admin subuser create --uid=mybucket --subuser=mybucket:alice
> --access=full --gen-access-key --gen-secret --key-type=s3
>
> ## write-only server user
> radosgw-admin subuser create --uid=mybucket --subuser=mybucket:daemon
> --access=write --gen-access-key --gen-secret-key --key-type=s3
>
> If you then do a "radosgw-admin metadata get user:mybucket", the JSON
> output contains the subusers and their keys.
>
>
> 3.  Raise the RGW log level in ceph.conf to make an "access key id" line
> available for each request, which you can then map to a subuser if/when you
> need to track who did what after the fact.  In ceph.conf:
>
> debug_rgw = 10/10
>
> This will cause the logs to be VERY verbose, an order of magnitude and
> some change more verbose than default.   We plan to discard most of the
> logs while feeding them into ElasticSearch.
>
> We might not need this much log verbosity once we have policies and are
> using unique users rather than subusers.
>
> Nevertheless, I hope we can eventually reduce the log level of the "access
> key id" line, as we have a pretty mainstream use case and I'm certain that
> tracking S3 request users will be required for many organizations for
> accounting and forensic purposes just as it is for us.
>
> -- Trey
>
> On Thu, Apr 13, 2017 at 1:29 PM, <ceph.nov...@habmalnefrage.de[mailto:
> ceph.nov...@habmalnefrage.de]> wrote:Hey Trey.
>
> Sounds great, we were discussing the same kind of requirements and
> couldn't agree on/find something "useful"... so THANK YOU for sharing!!!
>
> It would be great if you could provide some more details or an example how
> you configure the "bucket user" and sub-users and all that stuff.
> Even more interesting for me, how do the "different ppl or services"
> access that buckets/objects afterwards?! I mean via which tools (s3cmd,
> boto, cyberduck, mix of some, ...) and are there any ACLs set/in use as
> well?!
>
> (sorry if this all sounds somehow dumb but I'm a just a novice ;) )
>
> best
> 

Re: [ceph-users] Question about RadosGW subusers

2017-04-13 Thread ceph . novice
Thanks a lot, Trey.

I'll try that stuff next week, once back from Easter holidays.
And some "multi site" and "metasearch" is also still on my to-be-tested list. 
Need badly to free up some time for all the interesting "future of storage" 
things.

BTW., we are on Kraken and I'd hope to see more of the new and shiny stuff here 
soon (something like 11.2.X) instead of waiting for Luminous late 2017. Not 
sure how the CEPH release policy is usually?!

Anyhow, thanks and happy Easter everyone!
Anton
 

Gesendet: Donnerstag, 13. April 2017 um 20:15 Uhr
Von: "Trey Palmer" <t...@mailchimp.com>
An: ceph.nov...@habmalnefrage.de
Cc: "Trey Palmer" <t...@mailchimp.com>, ceph-us...@ceph.com
Betreff: Re: [ceph-users] Question about RadosGW subusers

Anton, 
 
It turns out that Adam Emerson is trying to get bucket policies and roles 
merged in time for Luminous:
 
https://github.com/ceph/ceph/pull/14307
 
Given this, I think we will only be using subusers temporarily as a method to 
track which human or service did what in which bucket.  This seems to us much 
easier than trying to deal with ACL's without any concept of groups, roles, or 
policies, in buckets that can often have millions of objects.
 
Here is the general idea:
 
 
1.  Each bucket has a user ("master user"), but we don't use or issue that set 
of keys at all.   

 
radosgw-admin user create --uid=mybucket --display-name="My Bucket"
 
You can of course have multiple buckets per user but so far for us it has been 
simple to have one user per bucket, with the username the same as the bucket 
name.   If a human needs access to more than one bucket, we will create 
multiple subusers for them.   That's not convenient, but it's temporary.  
 
So what we're doing is effectively making the user into the group, with the 
subusers being the users, and each user only capable of being in one group.   
Very suboptimal, but better than the total chaos that would result from giving 
everyone the same set of keys for a given bucket.
 
 
2.  For each human user or service/machine user of that bucket, we create 
subusers.    You can do this via: 
 
## full-control ops user
radosgw-admin subuser create --uid=mybucket --subuser=mybucket:alice 
--access=full --gen-access-key --gen-secret --key-type=s3
 
## write-only server user
radosgw-admin subuser create --uid=mybucket --subuser=mybucket:daemon 
--access=write --gen-access-key --gen-secret-key --key-type=s3
 
If you then do a "radosgw-admin metadata get user:mybucket", the JSON output 
contains the subusers and their keys.
 
 
3.  Raise the RGW log level in ceph.conf to make an "access key id" line 
available for each request, which you can then map to a subuser if/when you 
need to track who did what after the fact.  In ceph.conf:
 
debug_rgw = 10/10
 
This will cause the logs to be VERY verbose, an order of magnitude and some 
change more verbose than default.   We plan to discard most of the logs while 
feeding them into ElasticSearch.
 
We might not need this much log verbosity once we have policies and are using 
unique users rather than subusers. 
 
Nevertheless, I hope we can eventually reduce the log level of the "access key 
id" line, as we have a pretty mainstream use case and I'm certain that tracking 
S3 request users will be required for many organizations for accounting and 
forensic purposes just as it is for us.
 
    -- Trey
 
On Thu, Apr 13, 2017 at 1:29 PM, 
<ceph.nov...@habmalnefrage.de[mailto:ceph.nov...@habmalnefrage.de]> wrote:Hey 
Trey.

Sounds great, we were discussing the same kind of requirements and couldn't 
agree on/find something "useful"... so THANK YOU for sharing!!!

It would be great if you could provide some more details or an example how you 
configure the "bucket user" and sub-users and all that stuff.
Even more interesting for me, how do the "different ppl or services" access 
that buckets/objects afterwards?! I mean via which tools (s3cmd, boto, 
cyberduck, mix of some, ...) and are there any ACLs set/in use as well?!
 
(sorry if this all sounds somehow dumb but I'm a just a novice ;) )
 
best
 Anton
 

Gesendet: Dienstag, 11. April 2017 um 00:17 Uhr
Von: "Trey Palmer" <t...@mailchimp.com[mailto:t...@mailchimp.com]>
An: ceph-us...@ceph.com[mailto:ceph-us...@ceph.com]
Betreff: [ceph-users] Question about RadosGW subusers

Probably a question for @yehuda :
 

We have fairly strict user accountability requirements.  The best way we have 
found to meet them with S3 object storage on Ceph is by using RadosGW subusers.
 
If we set up one user per bucket, then set up subusers to provide separate 
individual S3 keys and access rights for different people or services using 
that bucket, then we can track who did what via access key in the RadosGW logs 
(at debug_rgw = 10/10).
 
Of course, this is not a documented use case for subu

Re: [ceph-users] Question about RadosGW subusers

2017-04-13 Thread Trey Palmer
Anton,

It turns out that Adam Emerson is trying to get bucket policies and roles
merged in time for Luminous:

https://github.com/ceph/ceph/pull/14307

Given this, I think we will only be using subusers temporarily as a method
to track which human or service did what in which bucket.  This seems to us
much easier than trying to deal with ACL's without any concept of groups,
roles, or policies, in buckets that can often have millions of objects.

Here is the general idea:


1.  Each bucket has a user ("master user"), but we don't use or issue that
set of keys at all.

radosgw-admin user create --uid=mybucket --display-name="My Bucket"

You can of course have multiple buckets per user but so far for us it has
been simple to have one user per bucket, with the username the same as the
bucket name.   If a human needs access to more than one bucket, we will
create multiple subusers for them.   That's not convenient, but it's
temporary.

So what we're doing is effectively making the user into the group, with the
subusers being the users, and each user only capable of being in one group.
  Very suboptimal, but better than the total chaos that would result from
giving everyone the same set of keys for a given bucket.


2.  For each human user or service/machine user of that bucket, we create
subusers.You can do this via:

## full-control ops user
radosgw-admin subuser create --uid=mybucket --subuser=mybucket:alice
--access=full --gen-access-key --gen-secret --key-type=s3

## write-only server user
radosgw-admin subuser create --uid=mybucket --subuser=mybucket:daemon
--access=write --gen-access-key --gen-secret-key --key-type=s3

If you then do a "radosgw-admin metadata get user:mybucket", the JSON
output contains the subusers and their keys.


3.  Raise the RGW log level in ceph.conf to make an "access key id" line
available for each request, which you can then map to a subuser if/when you
need to track who did what after the fact.  In ceph.conf:

debug_rgw = 10/10

This will cause the logs to be VERY verbose, an order of magnitude and some
change more verbose than default.   We plan to discard most of the logs
while feeding them into ElasticSearch.

We might not need this much log verbosity once we have policies and are
using unique users rather than subusers.

Nevertheless, I hope we can eventually reduce the log level of the "access
key id" line, as we have a pretty mainstream use case and I'm certain that
tracking S3 request users will be required for many organizations for
accounting and forensic purposes just as it is for us.

-- Trey

On Thu, Apr 13, 2017 at 1:29 PM, <ceph.nov...@habmalnefrage.de> wrote:

> Hey Trey.
>
> Sounds great, we were discussing the same kind of requirements and
> couldn't agree on/find something "useful"... so THANK YOU for sharing!!!
>
> It would be great if you could provide some more details or an example how
> you configure the "bucket user" and sub-users and all that stuff.
> Even more interesting for me, how do the "different ppl or services"
> access that buckets/objects afterwards?! I mean via which tools (s3cmd,
> boto, cyberduck, mix of some, ...) and are there any ACLs set/in use as
> well?!
>
> (sorry if this all sounds somehow dumb but I'm a just a novice ;) )
>
> best
>  Anton
>
>
> Gesendet: Dienstag, 11. April 2017 um 00:17 Uhr
> Von: "Trey Palmer" <t...@mailchimp.com>
> An: ceph-us...@ceph.com
> Betreff: [ceph-users] Question about RadosGW subusers
>
> Probably a question for @yehuda :
>
>
> We have fairly strict user accountability requirements.  The best way we
> have found to meet them with S3 object storage on Ceph is by using RadosGW
> subusers.
>
> If we set up one user per bucket, then set up subusers to provide separate
> individual S3 keys and access rights for different people or services using
> that bucket, then we can track who did what via access key in the RadosGW
> logs (at debug_rgw = 10/10).
>
> Of course, this is not a documented use case for subusers.  I'm wondering
> if Yehuda or anyone else could estimate our risk of future incompatibility
> if we implement user/key management around subusers in this manner?
>
> Thanks,
>
> Trey___ ceph-users mailing
> list ceph-users@lists.ceph.com http://lists.ceph.com/
> listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about RadosGW subusers

2017-04-13 Thread ceph . novice
Hey Trey.

Sounds great, we were discussing the same kind of requirements and couldn't 
agree on/find something "useful"... so THANK YOU for sharing!!!

It would be great if you could provide some more details or an example how you 
configure the "bucket user" and sub-users and all that stuff.
Even more interesting for me, how do the "different ppl or services" access 
that buckets/objects afterwards?! I mean via which tools (s3cmd, boto, 
cyberduck, mix of some, ...) and are there any ACLs set/in use as well?!
 
(sorry if this all sounds somehow dumb but I'm a just a novice ;) )
 
best
 Anton
 

Gesendet: Dienstag, 11. April 2017 um 00:17 Uhr
Von: "Trey Palmer" <t...@mailchimp.com>
An: ceph-us...@ceph.com
Betreff: [ceph-users] Question about RadosGW subusers

Probably a question for @yehuda :
 

We have fairly strict user accountability requirements.  The best way we have 
found to meet them with S3 object storage on Ceph is by using RadosGW subusers.
 
If we set up one user per bucket, then set up subusers to provide separate 
individual S3 keys and access rights for different people or services using 
that bucket, then we can track who did what via access key in the RadosGW logs 
(at debug_rgw = 10/10).
 
Of course, this is not a documented use case for subusers.  I'm wondering if 
Yehuda or anyone else could estimate our risk of future incompatibility if we 
implement user/key management around subusers in this manner?
 
Thanks,
 
Trey___ ceph-users mailing list 
ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about RadosGW subusers

2017-04-10 Thread Trey Palmer
Probably a question for @yehuda :

We have fairly strict user accountability requirements.  The best way we
have found to meet them with S3 object storage on Ceph is by using RadosGW
subusers.

If we set up one user per bucket, then set up subusers to provide separate
individual S3 keys and access rights for different people or services using
that bucket, then we can track who did what via access key in the RadosGW
logs (at debug_rgw = 10/10).

Of course, this is not a documented use case for subusers.  I'm wondering
if Yehuda or anyone else could estimate our risk of future incompatibility
if we implement user/key management around subusers in this manner?

Thanks,

Trey
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about unfound objects

2017-03-30 Thread Nick Fisk
That’s interesting, the only time I have experienced unfound objects has also 
been related to snapshots and highly likely snap trimming. I had a number of 
OSD’s start flapping under load of snap trimming and 2 of them on the same host 
died with an assert.

 

>From memory the unfound objects were relating to objects that were trimmed, so 
>I could just delete them. I assume that when the PG is remapped/recovered, as 
>the objects have already been removed on the other OSD’s it tries to roll back 
>the transaction and fails, hence it wants the now down OSD’s to try and roll 
>back???

 

From: Steve Taylor [mailto:steve.tay...@storagecraft.com] 
Sent: 30 March 2017 20:07
To: n...@fisk.me.uk; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Question about unfound objects

 

One other thing to note with this experience is that we do a LOT of RBD snap 
trimming, like hundreds of millions of objects per day added to our snap_trimqs 
globally. All of the unfound objects in these cases were found on other OSDs in 
the cluster with identical contents, but associated with different snapshots. 
In other words, the file contents matched exactly, but the xattrs differed and 
the filenames indicated that the objects belonged to different snapshots.

 

Some of the unfound objects belonged to head, so I don't necessarily believe 
that they were in the process of being trimmed, but I imagine there is some 
possibility that this issue is related to snap trimming or deleting snapshots. 
Just more information...

 

On Thu, 2017-03-30 at 17:13 +, Steve Taylor wrote:

Good suggestion, Nick. I actually did that at the time. The "ceph osd map" 
wasn't all that interesting because the OSDs had been outed and their PGs had 
been mapped to new OSDs. Everything appeared to be in order with the PGs being 
mapped to the right number of new OSDs. The PG mappings looked fine, but the 
objects just didn't exist anywhere except on the OSDs that had been marked out.

 

The PG queries were a little more useful, but still didn't really help in the 
end. In all cases (unfound objects from 2 OSDs in each of 2 occurrences), the 
PGs showed 5 or so OSDs where they thought the unfound objects might be, one of 
which was an OSD that had been marked out. In both cases we even waited until 
backfilling completed to see if perhaps the missing objects would turn up 
somewhere else, but none ever did.

 

In the first instance we were simply able to reattach the 2 OSDs to the cluster 
with 0 weight and recover the unfound objects. The second instance involved 
drive problems and was a little bit trickier. The drives had experienced errors 
and the XFS filesystems had both become corrupt and wouldn't even mount. We 
didn't have any spare drives large enough, so I ended up using dd, ignoring 
errors, to copy the disks to RBDs in a different Ceph cluster. I then kernel 
mapped the RBDs on the host with the failed drives, ran XFS repairs on them, 
mouted them to the OSD directories, started the OSDs, and put them back in the 
cluster with 0 weight. I was lucky enough that those objects were available and 
they were recovered. Of course I immediately removed those OSDs once the 
unfound objects cleared up.

 

That's the other intersting aspect of this problem. This cluster had 4TB HGST 
drives for its OSDs, but we had to expand it fairly urgently and didn't have 
enough drives. We added two new servers, each with 16 4TB drives and 16 8TB 
HGST He8 drives. In both instances the problems we encountered were with the 
8TB drives. We have since acquired more 4TB drives and have replaced all of the 
8TB drives in the cluster. We have a total of 8 production clusters globally 
and have been running Ceph in production for 2 years. These two occurences 
recently are the only times we've seen these types of issues, and it was 
exclusive to the 8TB OSDs. I'm not sure how that would cause such a problem, 
but it's an interesting data point.

 

On Thu, 2017-03-30 at 17:33 +0100, Nick Fisk wrote:

Hi Steve,

 

If you can recreate or if you can remember the object name, it might be worth 
trying to run “ceph osd map” on the objects and see where it thinks they map 
to. And/or maybe pg query might show something?

 

Nick

 


  _  


 <https://storagecraft.com> 

Steve Taylor | Senior Software Engineer |  <https://storagecraft.com> 
StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 


  _  


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


  _  


  _  


 <https://storagecraft.com> 

Steve Taylor | Senior Software Engineer |  <https://storagecraft.com> 
StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 

Re: [ceph-users] Question about unfound objects

2017-03-30 Thread Steve Taylor
One other thing to note with this experience is that we do a LOT of RBD snap 
trimming, like hundreds of millions of objects per day added to our snap_trimqs 
globally. All of the unfound objects in these cases were found on other OSDs in 
the cluster with identical contents, but associated with different snapshots. 
In other words, the file contents matched exactly, but the xattrs differed and 
the filenames indicated that the objects belonged to different snapshots.

Some of the unfound objects belonged to head, so I don't necessarily believe 
that they were in the process of being trimmed, but I imagine there is some 
possibility that this issue is related to snap trimming or deleting snapshots. 
Just more information...

On Thu, 2017-03-30 at 17:13 +, Steve Taylor wrote:

Good suggestion, Nick. I actually did that at the time. The "ceph osd map" 
wasn't all that interesting because the OSDs had been outed and their PGs had 
been mapped to new OSDs. Everything appeared to be in order with the PGs being 
mapped to the right number of new OSDs. The PG mappings looked fine, but the 
objects just didn't exist anywhere except on the OSDs that had been marked out.

The PG queries were a little more useful, but still didn't really help in the 
end. In all cases (unfound objects from 2 OSDs in each of 2 occurrences), the 
PGs showed 5 or so OSDs where they thought the unfound objects might be, one of 
which was an OSD that had been marked out. In both cases we even waited until 
backfilling completed to see if perhaps the missing objects would turn up 
somewhere else, but none ever did.

In the first instance we were simply able to reattach the 2 OSDs to the cluster 
with 0 weight and recover the unfound objects. The second instance involved 
drive problems and was a little bit trickier. The drives had experienced errors 
and the XFS filesystems had both become corrupt and wouldn't even mount. We 
didn't have any spare drives large enough, so I ended up using dd, ignoring 
errors, to copy the disks to RBDs in a different Ceph cluster. I then kernel 
mapped the RBDs on the host with the failed drives, ran XFS repairs on them, 
mouted them to the OSD directories, started the OSDs, and put them back in the 
cluster with 0 weight. I was lucky enough that those objects were available and 
they were recovered. Of course I immediately removed those OSDs once the 
unfound objects cleared up.

That's the other intersting aspect of this problem. This cluster had 4TB HGST 
drives for its OSDs, but we had to expand it fairly urgently and didn't have 
enough drives. We added two new servers, each with 16 4TB drives and 16 8TB 
HGST He8 drives. In both instances the problems we encountered were with the 
8TB drives. We have since acquired more 4TB drives and have replaced all of the 
8TB drives in the cluster. We have a total of 8 production clusters globally 
and have been running Ceph in production for 2 years. These two occurences 
recently are the only times we've seen these types of issues, and it was 
exclusive to the 8TB OSDs. I'm not sure how that would cause such a problem, 
but it's an interesting data point.

On Thu, 2017-03-30 at 17:33 +0100, Nick Fisk wrote:
Hi Steve,

If you can recreate or if you can remember the object name, it might be worth 
trying to run “ceph osd map” on the objects and see where it thinks they map 
to. And/or maybe pg query might show something?

Nick




[cid:1490900827.2469.72.camel@storagecraft.com]<https://storagecraft.com>   
Steve Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.





[cid:imagef0e9d2.JPG@294fd64f.4893a633]<https://storagecraft.com>   Steve 
Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve 
Taylor
Sent: 30 March 2017 16:24
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Question about unfound objects

We've had a couple of puzzling experiences recently with unfound
objects, and I wonder if anyone can shed some light.

This happened with Ham

Re: [ceph-users] Question about unfound objects

2017-03-30 Thread Steve Taylor
Good suggestion, Nick. I actually did that at the time. The "ceph osd map" 
wasn't all that interesting because the OSDs had been outed and their PGs had 
been mapped to new OSDs. Everything appeared to be in order with the PGs being 
mapped to the right number of new OSDs. The PG mappings looked fine, but the 
objects just didn't exist anywhere except on the OSDs that had been marked out.

The PG queries were a little more useful, but still didn't really help in the 
end. In all cases (unfound objects from 2 OSDs in each of 2 occurrences), the 
PGs showed 5 or so OSDs where they thought the unfound objects might be, one of 
which was an OSD that had been marked out. In both cases we even waited until 
backfilling completed to see if perhaps the missing objects would turn up 
somewhere else, but none ever did.

In the first instance we were simply able to reattach the 2 OSDs to the cluster 
with 0 weight and recover the unfound objects. The second instance involved 
drive problems and was a little bit trickier. The drives had experienced errors 
and the XFS filesystems had both become corrupt and wouldn't even mount. We 
didn't have any spare drives large enough, so I ended up using dd, ignoring 
errors, to copy the disks to RBDs in a different Ceph cluster. I then kernel 
mapped the RBDs on the host with the failed drives, ran XFS repairs on them, 
mouted them to the OSD directories, started the OSDs, and put them back in the 
cluster with 0 weight. I was lucky enough that those objects were available and 
they were recovered. Of course I immediately removed those OSDs once the 
unfound objects cleared up.

That's the other intersting aspect of this problem. This cluster had 4TB HGST 
drives for its OSDs, but we had to expand it fairly urgently and didn't have 
enough drives. We added two new servers, each with 16 4TB drives and 16 8TB 
HGST He8 drives. In both instances the problems we encountered were with the 
8TB drives. We have since acquired more 4TB drives and have replaced all of the 
8TB drives in the cluster. We have a total of 8 production clusters globally 
and have been running Ceph in production for 2 years. These two occurences 
recently are the only times we've seen these types of issues, and it was 
exclusive to the 8TB OSDs. I'm not sure how that would cause such a problem, 
but it's an interesting data point.

On Thu, 2017-03-30 at 17:33 +0100, Nick Fisk wrote:
Hi Steve,

If you can recreate or if you can remember the object name, it might be worth 
trying to run “ceph osd map” on the objects and see where it thinks they map 
to. And/or maybe pg query might show something?

Nick




[cid:imagec0161b.JPG@d2cd1459.4ebbf9d5]<https://storagecraft.com>   Steve 
Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve 
Taylor
Sent: 30 March 2017 16:24
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Question about unfound objects

We've had a couple of puzzling experiences recently with unfound
objects, and I wonder if anyone can shed some light.

This happened with Hammer 0.94.7 on a cluster with 1,309 OSDs. Our use
case is exclusively RBD in this cluster, so it's naturally replicated.
The rbd pool size is 3, min_size is 2. The crush map is flat, so each
host is a failure domain. The OSD hosts are 4U Supermicro chassis with
32 OSDs each. Drive failures have caused the OSD count to be 1,309
instead of 1,312.

Twice in the last few weeks we've experienced issues where the cluster
was HEALTH_OK but was frequently getting some blocked requests. In each
of the two occurrences we investigated and discovered that the blocked
requests resulted from two drives in the same host that were
misbehaving (different set of 2 drives in each occurrence). We decided
to remove the misbehaving OSDs and let things backfill to see if that
would address the issue. Removing the drives resulted in a small number
of unfound objects, which was surprising. We were able to add the OSDs
back with 0 weight and recover the unfound objects in both cases, but
removing two OSDs from a single failure domain shouldn't have resulted
in unfound objects in an otherwise healthy cluster, correct?

[cid:1490894021.2469.65.camel@storagecraft.com]<http://xo4t.mj.am/lnk/ADsAAHBLtsEAAF3gdq4AADNJBWwAAACRXwBY3TNPpyGQvpKrR9qnYfGowzXSBwAAlBI/1/9Putges-Yax4GeLa0aybAg/aHR0cHM6Ly9zdG9yYWdlY3JhZnQuY29t>

Steve Taylor | Senior Software Engine

Re: [ceph-users] Question about unfound objects

2017-03-30 Thread Nick Fisk
Hi Steve,

 

If you can recreate or if you can remember the object name, it might be worth 
trying to run "ceph osd map" on the objects and see
where it thinks they map to. And/or maybe pg query might show something?

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve 
Taylor
Sent: 30 March 2017 16:24
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Question about unfound objects

 

We've had a couple of puzzling experiences recently with unfound
objects, and I wonder if anyone can shed some light.

This happened with Hammer 0.94.7 on a cluster with 1,309 OSDs. Our use
case is exclusively RBD in this cluster, so it's naturally replicated.
The rbd pool size is 3, min_size is 2. The crush map is flat, so each
host is a failure domain. The OSD hosts are 4U Supermicro chassis with
32 OSDs each. Drive failures have caused the OSD count to be 1,309
instead of 1,312.

Twice in the last few weeks we've experienced issues where the cluster
was HEALTH_OK but was frequently getting some blocked requests. In each
of the two occurrences we investigated and discovered that the blocked
requests resulted from two drives in the same host that were
misbehaving (different set of 2 drives in each occurrence). We decided
to remove the misbehaving OSDs and let things backfill to see if that
would address the issue. Removing the drives resulted in a small number
of unfound objects, which was surprising. We were able to add the OSDs
back with 0 weight and recover the unfound objects in both cases, but
removing two OSDs from a single failure domain shouldn't have resulted
in unfound objects in an otherwise healthy cluster, correct?

  _  


 <https://storagecraft.com> 

Steve Taylor | Senior Software Engineer |  <https://storagecraft.com> 
StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 

  _  


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together
with any attachments, and be advised that any dissemination or copying of this 
message is prohibited.

  _  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about unfound objects

2017-03-30 Thread Steve Taylor
We've had a couple of puzzling experiences recently with unfound
objects, and I wonder if anyone can shed some light.

This happened with Hammer 0.94.7 on a cluster with 1,309 OSDs. Our use
case is exclusively RBD in this cluster, so it's naturally replicated.
The rbd pool size is 3, min_size is 2. The crush map is flat, so each
host is a failure domain. The OSD hosts are 4U Supermicro chassis with
32 OSDs each. Drive failures have caused the OSD count to be 1,309
instead of 1,312.

Twice in the last few weeks we've experienced issues where the cluster
was HEALTH_OK but was frequently getting some blocked requests. In each
of the two occurrences we investigated and discovered that the blocked
requests resulted from two drives in the same host that were
misbehaving (different set of 2 drives in each occurrence). We decided
to remove the misbehaving OSDs and let things backfill to see if that
would address the issue. Removing the drives resulted in a small number
of unfound objects, which was surprising. We were able to add the OSDs
back with 0 weight and recover the unfound objects in both cases, but
removing two OSDs from a single failure domain shouldn't have resulted
in unfound objects in an otherwise healthy cluster, correct?



[cid:image575d42.JPG@8ddd3310.40afc06a]   Steve 
Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about block sizes, rados objects and file striping (and maybe more)

2017-03-20 Thread Jason Dillaman
On Mon, Mar 20, 2017 at 6:49 PM, Alejandro Comisario
 wrote:
> Jason, thanks for the reply, you really got my question right.
> So, some doubts that might show that i lack of some general knowledge.
>
> When i read that someone is testing a ceph cluster with secuential 4k
> block writes, does that could happen inside a vm that is using an RBD
> backed OS ?

You can use some benchmarks directly against librbd (e.g. see fio's
rbd engine), some within a VM against an RBD-backed block device, and
some within a VM against a filesystem backed by an RBD-backed block
device.

> In that case, should the vm's FS should be formated to allow 4K writes
>  so that the block level of the vm writes 4K down to the hypervisor ?
>
> In that case, asuming that i have a 9K mtu between the compute node
> and the ceph cluster.
> What is the default rados block size in whitch the objects are divided
> against the amount of information ?

MTU size (network maximum packet size) and the RBD block object size
are not interrelated.

>
> On Mon, Mar 20, 2017 at 7:06 PM, Jason Dillaman  wrote:
>> It's a very broad question -- are you trying to determine something
>> more specific?
>>
>> Notionally, your DB engine will safely journal the changes to disk,
>> commit the changes to the backing table structures, and prune the
>> journal. Your mileage my vary depending on the specific DB engine and
>> its configuration settings.
>>
>> The VM's OS will send write requests addressed by block offset and
>> block counts (e.g. 512 blocks) through the block device hardware
>> (either a slower emulated block device or a faster paravirtualized
>> block device like virtio-blk/virtio-scsi). Within the internals of
>> QEMU, these block-addressed write requests will be delivered to librbd
>> in byte-addressed format (the blocks are converted to absolute byte
>> ranges).
>>
>> librbd will take the provided byte offset and length and quickly
>> calculate which backing RADOS objects are associated with the provided
>> range [1]. If the extent intersects multiple backing objects, the
>> sub-operation is sent to each affected object in parallel. These
>> operations will be sent to the OSDs responsible for handling the
>> object (as per the CRUSH map) -- by default via TCP/IP. The MTU is the
>> maximum size of each IP packet -- larger MTUs allow you to send more
>> data within a single packet [2].
>>
>> [1] http://docs.ceph.com/docs/master/architecture/#data-striping
>> [2] https://en.wikipedia.org/wiki/Maximum_transmission_unit
>>
>>
>>
>> On Mon, Mar 20, 2017 at 5:24 PM, Alejandro Comisario
>>  wrote:
>>> anyone ?
>>>
>>> On Fri, Mar 17, 2017 at 5:40 PM, Alejandro Comisario
>>>  wrote:
 Hi, it's been a while since im using Ceph, and still im a little
 ashamed that when certain situation happens, i dont have the knowledge
 to explain or plan things.

 Basically what i dont know is, and i will do an exercise.

 EXCERCISE:
 a virtual machine running on KVM has an extra block device where the
 datafiles of a database runs (this block device is exposed to the vm
 using libvirt)

 facts.
 * the db writes to disk in 8K blocks
 * the connection between the phisical compute node and Ceph has an MTU of 
 1500
 * the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
 * everything else is default

 So conceptually, if someone can explain me, what happens from the
 momment the DB contained on the VM commits to disk a query of
 20MBytes, what happens on the compute node, what happens on the
 client's file striping, what happens on the network (regarding
 packages, if other than creating 1500 bytes packages), what happens
 with rados objects, block sizes, etc.

 I would love to read this from the bests, mainly because as i said i
 dont understand all the workflow of blocks, objects, etc.

 thanks to everyone !

 --
 Alejandrito
>>>
>>>
>>>
>>> --
>>> Alejandro Comisario
>>> CTO | NUBELIU
>>> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
>>> _
>>> www.nubeliu.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason
>
>
>
> --
> Alejandro Comisario
> CTO | NUBELIU
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about block sizes, rados objects and file striping (and maybe more)

2017-03-20 Thread Alejandro Comisario
Jason, thanks for the reply, you really got my question right.
So, some doubts that might show that i lack of some general knowledge.

When i read that someone is testing a ceph cluster with secuential 4k
block writes, does that could happen inside a vm that is using an RBD
backed OS ?
In that case, should the vm's FS should be formated to allow 4K writes
 so that the block level of the vm writes 4K down to the hypervisor ?

In that case, asuming that i have a 9K mtu between the compute node
and the ceph cluster.
What is the default rados block size in whitch the objects are divided
against the amount of information ?


On Mon, Mar 20, 2017 at 7:06 PM, Jason Dillaman  wrote:
> It's a very broad question -- are you trying to determine something
> more specific?
>
> Notionally, your DB engine will safely journal the changes to disk,
> commit the changes to the backing table structures, and prune the
> journal. Your mileage my vary depending on the specific DB engine and
> its configuration settings.
>
> The VM's OS will send write requests addressed by block offset and
> block counts (e.g. 512 blocks) through the block device hardware
> (either a slower emulated block device or a faster paravirtualized
> block device like virtio-blk/virtio-scsi). Within the internals of
> QEMU, these block-addressed write requests will be delivered to librbd
> in byte-addressed format (the blocks are converted to absolute byte
> ranges).
>
> librbd will take the provided byte offset and length and quickly
> calculate which backing RADOS objects are associated with the provided
> range [1]. If the extent intersects multiple backing objects, the
> sub-operation is sent to each affected object in parallel. These
> operations will be sent to the OSDs responsible for handling the
> object (as per the CRUSH map) -- by default via TCP/IP. The MTU is the
> maximum size of each IP packet -- larger MTUs allow you to send more
> data within a single packet [2].
>
> [1] http://docs.ceph.com/docs/master/architecture/#data-striping
> [2] https://en.wikipedia.org/wiki/Maximum_transmission_unit
>
>
>
> On Mon, Mar 20, 2017 at 5:24 PM, Alejandro Comisario
>  wrote:
>> anyone ?
>>
>> On Fri, Mar 17, 2017 at 5:40 PM, Alejandro Comisario
>>  wrote:
>>> Hi, it's been a while since im using Ceph, and still im a little
>>> ashamed that when certain situation happens, i dont have the knowledge
>>> to explain or plan things.
>>>
>>> Basically what i dont know is, and i will do an exercise.
>>>
>>> EXCERCISE:
>>> a virtual machine running on KVM has an extra block device where the
>>> datafiles of a database runs (this block device is exposed to the vm
>>> using libvirt)
>>>
>>> facts.
>>> * the db writes to disk in 8K blocks
>>> * the connection between the phisical compute node and Ceph has an MTU of 
>>> 1500
>>> * the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
>>> * everything else is default
>>>
>>> So conceptually, if someone can explain me, what happens from the
>>> momment the DB contained on the VM commits to disk a query of
>>> 20MBytes, what happens on the compute node, what happens on the
>>> client's file striping, what happens on the network (regarding
>>> packages, if other than creating 1500 bytes packages), what happens
>>> with rados objects, block sizes, etc.
>>>
>>> I would love to read this from the bests, mainly because as i said i
>>> dont understand all the workflow of blocks, objects, etc.
>>>
>>> thanks to everyone !
>>>
>>> --
>>> Alejandrito
>>
>>
>>
>> --
>> Alejandro Comisario
>> CTO | NUBELIU
>> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
>> _
>> www.nubeliu.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason



-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about block sizes, rados objects and file striping (and maybe more)

2017-03-20 Thread Jason Dillaman
It's a very broad question -- are you trying to determine something
more specific?

Notionally, your DB engine will safely journal the changes to disk,
commit the changes to the backing table structures, and prune the
journal. Your mileage my vary depending on the specific DB engine and
its configuration settings.

The VM's OS will send write requests addressed by block offset and
block counts (e.g. 512 blocks) through the block device hardware
(either a slower emulated block device or a faster paravirtualized
block device like virtio-blk/virtio-scsi). Within the internals of
QEMU, these block-addressed write requests will be delivered to librbd
in byte-addressed format (the blocks are converted to absolute byte
ranges).

librbd will take the provided byte offset and length and quickly
calculate which backing RADOS objects are associated with the provided
range [1]. If the extent intersects multiple backing objects, the
sub-operation is sent to each affected object in parallel. These
operations will be sent to the OSDs responsible for handling the
object (as per the CRUSH map) -- by default via TCP/IP. The MTU is the
maximum size of each IP packet -- larger MTUs allow you to send more
data within a single packet [2].

[1] http://docs.ceph.com/docs/master/architecture/#data-striping
[2] https://en.wikipedia.org/wiki/Maximum_transmission_unit



On Mon, Mar 20, 2017 at 5:24 PM, Alejandro Comisario
 wrote:
> anyone ?
>
> On Fri, Mar 17, 2017 at 5:40 PM, Alejandro Comisario
>  wrote:
>> Hi, it's been a while since im using Ceph, and still im a little
>> ashamed that when certain situation happens, i dont have the knowledge
>> to explain or plan things.
>>
>> Basically what i dont know is, and i will do an exercise.
>>
>> EXCERCISE:
>> a virtual machine running on KVM has an extra block device where the
>> datafiles of a database runs (this block device is exposed to the vm
>> using libvirt)
>>
>> facts.
>> * the db writes to disk in 8K blocks
>> * the connection between the phisical compute node and Ceph has an MTU of 
>> 1500
>> * the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
>> * everything else is default
>>
>> So conceptually, if someone can explain me, what happens from the
>> momment the DB contained on the VM commits to disk a query of
>> 20MBytes, what happens on the compute node, what happens on the
>> client's file striping, what happens on the network (regarding
>> packages, if other than creating 1500 bytes packages), what happens
>> with rados objects, block sizes, etc.
>>
>> I would love to read this from the bests, mainly because as i said i
>> dont understand all the workflow of blocks, objects, etc.
>>
>> thanks to everyone !
>>
>> --
>> Alejandrito
>
>
>
> --
> Alejandro Comisario
> CTO | NUBELIU
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about block sizes, rados objects and file striping (and maybe more)

2017-03-20 Thread Alejandro Comisario
anyone ?

On Fri, Mar 17, 2017 at 5:40 PM, Alejandro Comisario
 wrote:
> Hi, it's been a while since im using Ceph, and still im a little
> ashamed that when certain situation happens, i dont have the knowledge
> to explain or plan things.
>
> Basically what i dont know is, and i will do an exercise.
>
> EXCERCISE:
> a virtual machine running on KVM has an extra block device where the
> datafiles of a database runs (this block device is exposed to the vm
> using libvirt)
>
> facts.
> * the db writes to disk in 8K blocks
> * the connection between the phisical compute node and Ceph has an MTU of 1500
> * the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
> * everything else is default
>
> So conceptually, if someone can explain me, what happens from the
> momment the DB contained on the VM commits to disk a query of
> 20MBytes, what happens on the compute node, what happens on the
> client's file striping, what happens on the network (regarding
> packages, if other than creating 1500 bytes packages), what happens
> with rados objects, block sizes, etc.
>
> I would love to read this from the bests, mainly because as i said i
> dont understand all the workflow of blocks, objects, etc.
>
> thanks to everyone !
>
> --
> Alejandrito



-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question about block sizes, rados objects and file striping (and maybe more)

2017-03-17 Thread Alejandro Comisario
Hi, it's been a while since im using Ceph, and still im a little
ashamed that when certain situation happens, i dont have the knowledge
to explain or plan things.

Basically what i dont know is, and i will do an exercise.

EXCERCISE:
a virtual machine running on KVM has an extra block device where the
datafiles of a database runs (this block device is exposed to the vm
using libvirt)

facts.
* the db writes to disk in 8K blocks
* the connection between the phisical compute node and Ceph has an MTU of 1500
* the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
* everything else is default

So conceptually, if someone can explain me, what happens from the
momment the DB contained on the VM commits to disk a query of
20MBytes, what happens on the compute node, what happens on the
client's file striping, what happens on the network (regarding
packages, if other than creating 1500 bytes packages), what happens
with rados objects, block sizes, etc.

I would love to read this from the bests, mainly because as i said i
dont understand all the workflow of blocks, objects, etc.

thanks to everyone !

-- 
Alejandrito
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question regarding CRUSH algorithm

2017-02-17 Thread Richard Hesketh
On 16/02/17 20:44, girish kenkere wrote:
> Thanks David,
> 
> Its not quiet what i was looking for. Let me explain my question in more 
> detail -
> 
> This is excerpt from Crush paper, this explains how crush algo running on 
> each client/osd maps pg to an osd during the write operation[lets assume].
> 
> /"Tree buckets are structured as a weighted binary search tree with items at 
> the leaves. Each interior node knows the total weight of its left and right 
> subtrees and is labeled according to a fixed strategy (described below). In 
> order to select an item within a bucket, CRUSH starts at the root of the tree 
> and calculates the hash of the input key x, replica number r, the bucket 
> identifier, and the label at the current tree node (initially the root). The 
> result is compared to the weight ratio of the left and right subtrees to 
> decide which child node to visit next. This process is repeated until a leaf 
> node is reached, at which point the associated item in the bucket is chosen. 
> Only logn hashes and node comparisons are needed to locate an item.:"/
> 
>  My question is along the way the tree structure changes, weights of the 
> nodes change and some nodes even go away. In that case, how are future reads 
> lead to pg to same osd mapping? Its not cached anywhere, same algo runs for 
> every future read - what i am missing is how it picks the same osd[where data 
> resides] every time. With a modified crush map, won't we end up with 
> different leaf node if we apply same algo? 
> 
> Thanks
> Girish
> 
> On Thu, Feb 16, 2017 at 12:05 PM, David Turner <david.tur...@storagecraft.com 
> <mailto:david.tur...@storagecraft.com>> wrote:
> 
> As a piece to the puzzle, the client always has an up to date osd map 
> (which includes the crush map).  If it's out of date, then it has to get a 
> new one before it can request to read or write to the cluster.  That way the 
> client will never have old information and if you add or remove storage, the 
> client will always have the most up to date map to know where the current 
> copies of the files are.
> 
> This can cause slow downs in your cluster performance if you are updating 
> your osdmap frequently, which can be caused by deleting a lot of snapshots as 
> an example.

> 
> --
> *From:* ceph-users [ceph-users-boun...@lists.ceph.com 
> <mailto:ceph-users-boun...@lists.ceph.com>] on behalf of girish kenkere 
> [kngen...@gmail.com <mailto:kngen...@gmail.com>]
> *Sent:* Thursday, February 16, 2017 12:43 PM
> *To:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> *Subject:* [ceph-users] Question regarding CRUSH algorithm
> 
> Hi, I have a question regarding CRUSH algorithm - please let me know how 
> this works. CRUSH paper talks about how given an object we select OSD via two 
> mapping - first one is obj to PG and then PG to OSD. 
> 
> This PG to OSD mapping is something i dont understand. It uses pg#, 
> cluster map, and placement rules. How is it guaranteed to return correct OSD 
> for future reads after the cluster map/placement rules has changed due to 
> nodes coming and out?
> 
> Thanks
> Girish

I think there is confusion over when the CRUSH algorithm is being run. It's my 
understanding that the object->PG mapping is always dynamically computed, and 
that's pretty simple (hash the object ID, take it modulo [num_pgs in pool], 
prepend pool ID, 8.0b's your uncle), but the PG->OSD mapping is only computed 
when new PGs are created or the CRUSH map changes. The result of that 
computation is stored in the cluster map and then locating a particular PG is a 
matter of looking it up in the map, not recalculating its location - PG 
placement is pseudorandom and nondeterministic anyway

Re: [ceph-users] Question regarding CRUSH algorithm

2017-02-16 Thread girish kenkere
Thanks David,

Its not quiet what i was looking for. Let me explain my question in more
detail -

This is excerpt from Crush paper, this explains how crush algo running on
each client/osd maps pg to an osd during the write operation[lets assume].

*"Tree buckets are structured as a weighted binary search tree with items
at the leaves. Each interior node knows the total weight of its left and
right subtrees and is labeled according to a fixed strategy (described
below). In order to select an item within a bucket, CRUSH starts at the
root of the tree and calculates the hash of the input key x, replica number
r, the bucket identifier, and the label at the current tree node (initially
the root). The result is compared to the weight ratio of the left and right
subtrees to decide which child node to visit next. This process is repeated
until a leaf node is reached, at which point the associated item in the
bucket is chosen. Only logn hashes and node comparisons are needed to
locate an item.:"*

 My question is along the way the tree structure changes, weights of the
nodes change and some nodes even go away. In that case, how are future
reads lead to pg to same osd mapping? Its not cached anywhere, same algo
runs for every future read - what i am missing is how it picks the same
osd[where data resides] every time. With a modified crush map, won't we end
up with different leaf node if we apply same algo?

Thanks
Girish

On Thu, Feb 16, 2017 at 12:05 PM, David Turner <
david.tur...@storagecraft.com> wrote:

> As a piece to the puzzle, the client always has an up to date osd map
> (which includes the crush map).  If it's out of date, then it has to get a
> new one before it can request to read or write to the cluster.  That way
> the client will never have old information and if you add or remove
> storage, the client will always have the most up to date map to know where
> the current copies of the files are.
>
> This can cause slow downs in your cluster performance if you are updating
> your osdmap frequently, which can be caused by deleting a lot of snapshots
> as an example.
>
> --
>
> <https://storagecraft.com> David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
> <(385)%20224-2943>
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> --
>
> --
> *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> girish kenkere [kngen...@gmail.com]
> *Sent:* Thursday, February 16, 2017 12:43 PM
> *To:* ceph-users@lists.ceph.com
> *Subject:* [ceph-users] Question regarding CRUSH algorithm
>
> Hi, I have a question regarding CRUSH algorithm - please let me know how
> this works. CRUSH paper talks about how given an object we select OSD via
> two mapping - first one is obj to PG and then PG to OSD.
>
> This PG to OSD mapping is something i dont understand. It uses pg#,
> cluster map, and placement rules. How is it guaranteed to return correct
> OSD for future reads after the cluster map/placement rules has changed due
> to nodes coming and out?
>
> Thanks
> Girish
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question regarding CRUSH algorithm

2017-02-16 Thread David Turner
As a piece to the puzzle, the client always has an up to date osd map (which 
includes the crush map).  If it's out of date, then it has to get a new one 
before it can request to read or write to the cluster.  That way the client 
will never have old information and if you add or remove storage, the client 
will always have the most up to date map to know where the current copies of 
the files are.

This can cause slow downs in your cluster performance if you are updating your 
osdmap frequently, which can be caused by deleting a lot of snapshots as an 
example.



[cid:image906b8f.JPG@b17c0454.44a98971]<https://storagecraft.com>   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of girish 
kenkere [kngen...@gmail.com]
Sent: Thursday, February 16, 2017 12:43 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Question regarding CRUSH algorithm

Hi, I have a question regarding CRUSH algorithm - please let me know how this 
works. CRUSH paper talks about how given an object we select OSD via two 
mapping - first one is obj to PG and then PG to OSD.

This PG to OSD mapping is something i dont understand. It uses pg#, cluster 
map, and placement rules. How is it guaranteed to return correct OSD for future 
reads after the cluster map/placement rules has changed due to nodes coming and 
out?

Thanks
Girish
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question regarding CRUSH algorithm

2017-02-16 Thread girish kenkere
Hi, I have a question regarding CRUSH algorithm - please let me know how
this works. CRUSH paper talks about how given an object we select OSD via
two mapping - first one is obj to PG and then PG to OSD.

This PG to OSD mapping is something i dont understand. It uses pg#, cluster
map, and placement rules. How is it guaranteed to return correct OSD for
future reads after the cluster map/placement rules has changed due to nodes
coming and out?

Thanks
Girish
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about user's key

2017-01-20 Thread Joao Eduardo Luis

On 01/20/2017 03:52 AM, Chen, Wei D wrote:

Hi,

I have read through some documents about authentication and user management 
about ceph, everything works fine with me, I can create
a user and play with the keys and caps of that user. But I cannot find where 
those keys or capabilities stored, obviously, I can
export those info to a file but where are they if I don't export them out?

Looks like these information (keys and caps) of the user is stored in memory? 
but I still can list them out after rebooting my
machine. Or these info are persisted in some type of DB I didn't aware?

Can anyone help me out?


Authentication keys and caps are kept by the monitor in its store, 
either a leveldb or a rocksdb, in its data directory.


The monitor's data directory are, by default, in 
/var/lib/ceph/mon/ceph-X, with X being the monitor's id. The store is 
within that directory, named `store.db`.


The store in not in human-readable format, but you can use 
ceph-kvstore-tool to walk the keys if you want. Please note that, should 
you want to do this, the monitor must be shutdown first.


  -Joao

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about user's key

2017-01-20 Thread Martin Palma
I don't know exactly where but I'm guessing in the database of the
monitor server which should be located at
"/var/lib/ceph/mon/".

Best,
Martin

On Fri, Jan 20, 2017 at 8:55 AM, Chen, Wei D  wrote:
> Hi Martin,
>
> Thanks for your response!
> Could you pls tell me where it is on the monitor nodes? only in the memory or 
> persisted in any files or DBs? Looks like it’s not just in memory but I 
> cannot find where those value saved, thanks!
>
> Best Regards,
> Dave Chen
>
> From: Martin Palma [mailto:mar...@palma.bz]
> Sent: Friday, January 20, 2017 3:36 PM
> To: Ceph-User; Chen, Wei D; ceph-de...@vger.kernel.org
> Subject: Re: Question about user's key
>
> Hi,
>
> They are stored on the monitore nodes.
>
> Best,
> Martin
>
> On Fri, 20 Jan 2017 at 04:53, Chen, Wei D  wrote:
> Hi,
>
>
>
> I have read through some documents about authentication and user management 
> about ceph, everything works fine with me, I can create
>
> a user and play with the keys and caps of that user. But I cannot find where 
> those keys or capabilities stored, obviously, I can
>
> export those info to a file but where are they if I don't export them out?
>
>
>
> Looks like these information (keys and caps) of the user is stored in memory? 
> but I still can list them out after rebooting my
>
> machine. Or these info are persisted in some type of DB I didn't aware?
>
>
>
> Can anyone help me out?
>
>
>
>
>
> Best Regards,
>
> Dave Chen
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about user's key

2017-01-19 Thread Martin Palma
Hi,

They are stored on the monitore nodes.

Best,
Martin

On Fri, 20 Jan 2017 at 04:53, Chen, Wei D  wrote:

> Hi,
>
>
>
> I have read through some documents about authentication and user
> management about ceph, everything works fine with me, I can create
>
> a user and play with the keys and caps of that user. But I cannot find
> where those keys or capabilities stored, obviously, I can
>
> export those info to a file but where are they if I don't export them out?
>
>
>
> Looks like these information (keys and caps) of the user is stored in
> memory? but I still can list them out after rebooting my
>
> machine. Or these info are persisted in some type of DB I didn't aware?
>
>
>
> Can anyone help me out?
>
>
>
>
>
> Best Regards,
>
> Dave Chen
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question: can I use rbd 0.80.7 with ceph cluster version 10.2.5?

2016-12-21 Thread Stéphane Klein
Hi,

I have this issue:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/015216.html

Question: can I use rbd 0.80.7 with ceph cluster version 10.2.5?

Why I use this old version? Because I use Atomic Project
http://www.projectatomic.io/

Best regards,
Stéphane
-- 
Stéphane Klein 
blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about last_backfill

2016-11-06 Thread xxhdx1985126
Hi, everyone.


In the PGLog::merge_log method, pg log entries in "olog" are inserted into 
current PGLog's "missing" structure only when they have "version" larger than 
current PGLog's head and its target object has "soid" less than current pg 
info's last_backfill. What does "last_backfill" in pg_info_t mean? Is it the 
max object id that the pg possessed after the last recovery_backfill process? 
If so, why only objects with "soid" less than "last_backfill" is considered 
missing, what if new objects are created by the current osd and modified by 
other osd during the current osd was "out" or "down"?


I'm really confused about this, please help me, thank you:-)___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about PG class

2016-11-02 Thread xxhdx1985126
Hi, everyone.


What are the meanings of the fields  actingbackfill, want_acting and 
backfill_targets  of the PG class?
Thank you:-)___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about writing a program that transfer snapshot diffs between ceph clusters

2016-11-01 Thread Wes Dillingham
You might want to have a look at this:
https://github.com/camptocamp/ceph-rbd-backup/blob/master/ceph-rbd-backup.py

I have a bash implementation of this, but it basically boils down to
wrapping what peter said: an export-diff to stdout piped to an
import-diff on a different cluster. The "transfer" node is a client of
both clusters and simpy iterates over all rbd devices, snapshotting
them daily, and exporting the diff between todays snap and yesterdays
snap and layering that diff onto a sister rbd on the remote side.


On Tue, Nov 1, 2016 at 5:23 AM, Peter Maloney
 wrote:
> On 11/01/16 10:22, Peter Maloney wrote:
>
> On 11/01/16 06:57, xxhdx1985126 wrote:
>
> Hi, everyone.
>
> I'm trying to write a program based on the librbd API that transfers
> snapshot diffs between ceph clusters without the need for a temporary
> storage which is required if I use the "rbd export-diff" and "rbd
> import-diff" pair.
>
>
> You don't need a temp file for this... eg.
>
>
> oops forgot the "-" in the commands corrected:
>
> ssh node1 rbd export-diff rbd/blah@snap1 - | rbd import-diff - rbd/blah
> ssh node1 rbd export-diff --from-snap snap1 rbd/blah@snap2 - | rbd
> import-diff - rbd/blah
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about writing a program that transfer snapshot diffs between ceph clusters

2016-11-01 Thread Peter Maloney
On 11/01/16 10:22, Peter Maloney wrote:
> On 11/01/16 06:57, xxhdx1985126 wrote:
>> Hi, everyone.
>>
>> I'm trying to write a program based on the librbd API that transfers
>> snapshot diffs between ceph clusters without the need for a temporary
>> storage which is required if I use the "rbd export-diff" and "rbd
>> import-diff" pair.
>
> You don't need a temp file for this... eg.
>
>
oops forgot the "-" in the commands corrected:
> ssh node1 rbd export-diff rbd/blah@snap1 - | rbd import-diff - rbd/blah
> ssh node1 rbd export-diff --from-snap snap1 rbd/blah@snap2 - | rbd
> import-diff - rbd/blah
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about writing a program that transfer snapshot diffs between ceph clusters

2016-11-01 Thread Peter Maloney
On 11/01/16 06:57, xxhdx1985126 wrote:
> Hi, everyone.
>
> I'm trying to write a program based on the librbd API that transfers
> snapshot diffs between ceph clusters without the need for a temporary
> storage which is required if I use the "rbd export-diff" and "rbd
> import-diff" pair.

You don't need a temp file for this... eg.


ssh node1 rbd export-diff rbd/blah@snap1 | rbd import-diff rbd/blah
ssh node1 rbd export-diff --from-snap snap1 rbd/blah@snap2 | rbd
import-diff rbd/blah

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about writing a program that transfer snapshot diffs between ceph clusters

2016-10-31 Thread xxhdx1985126
Hi, everyone.


I'm trying to write a program based on the librbd API that transfers snapshot 
diffs between ceph clusters without the need for a temporary storage which is 
required if I use the "rbd export-diff" and "rbd import-diff" pair. I found 
that the configuration object "g_conf" and ceph context object "g_ceph_context" 
are global variables which are used almost everywhere in the source code, while 
what I need ot do in the first place is to construct two or more configuration 
objects, each corresponding to a ceph cluster, and make those operations 
intended to a ceph cluster use the corresponding configuration object. 


How can I accomplish this task? Or, is it just not viable? Thank you:-)___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about OSDSuperblock

2016-10-22 Thread xxhdx1985126
Title: 缃戞槗閭
Sorry, sir. I don't quite follow you. I agree that the osds must get the current map to know who to contact so it can catch up. But it looks to me that the osd is getting the current map through get_map(superblock.current_epoch) in which the content of the variable superblock.current_epoch is read from the disk by OSD::read_superblock and has't been updated by a monitor at boot time, which means it is not the real curent epoch but an old one. How can OSD get the current map using an old epoch?



On xxhdx1985126 , Oct 23, 2016 3:13 AM wrote:



Sorry, sir. I don't quite follow you. I agree that the osds must get the current map to know who to contact so it can catch up. But it looks to me that the osd is getting the current map through get_map(superblock.current_epoch) in which the variable superblock.current_epoch is read from the disk by OSD::read_superblock at boot time and has't been updated by a monitor, which means it is not the real curent epoch. How can OSD get the current map using an old epoch?



Sent from my Mi phoneOn David Turner , Oct 23, 2016 12:34 AM wrote:







The osd needs to know where it thought data was, in particular so it knows what it has. Then it gets the current map so it knows who to talk to so it can catch back up.

Sent from my iPhone

On Oct 22, 2016, at 7:12 AM, xxhdx1985126  wrote:





Hi, everyone.


I'm trying to read the source code that boots an OSD instance, and I find something really overwhelms me.
In the OSD::init() method, it read the OSDSuperblock by calling OSD::read_superblock(), and the it tried to get the "current" map : "osdmap = get_map(superblock.current_epoch)". Then OSD uses this osdmap to calculate the acting and up set of each pg.聽
I really don't understand this! Since the OSDSuperblock is read from the disk, the superblock.current_epoch should be an old epoch which is recorded by the last OSD instance that run on this directory. Why use an old "current_epoch" to calculate the acting
 and up set of each pg?


Please help me, thank you:-)




聽












David聽Turner聽|
Cloud Operations Engineer聽|
StorageCraft
 Technology Corporation
380 Data Drive Suite 300聽|
Draper聽|
Utah聽|
84020
Office:
801.871.2760 |
Mobile:
385.224.2943










If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

















___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about OSDSuperblock

2016-10-22 Thread xxhdx1985126
Title: 缃戞槗閭
Sorry, sir. I don't quite follow you. I agree that the osds must get the current map to know who to contact so it can catch up. But it looks to me that the osd is getting the current map through get_map(superblock.current_epoch) in which the variable superblock.current_epoch is read from the disk by OSD::read_superblock at boot time and has't been updated by a monitor, which means it is not the real curent epoch. How can OSD get the current map using an old epoch?



Sent from my Mi phoneOn David Turner , Oct 23, 2016 12:34 AM wrote:







The osd needs to know where it thought data was, in particular so it knows what it has. Then it gets the current map so it knows who to talk to so it can catch back up.

Sent from my iPhone

On Oct 22, 2016, at 7:12 AM, xxhdx1985126  wrote:





Hi, everyone.


I'm trying to read the source code that boots an OSD instance, and I find something really overwhelms me.
In the OSD::init() method, it read the OSDSuperblock by calling OSD::read_superblock(), and the it tried to get the "current" map : "osdmap = get_map(superblock.current_epoch)". Then OSD uses this osdmap to calculate the acting and up set of each pg.聽
I really don't understand this! Since the OSDSuperblock is read from the disk, the superblock.current_epoch should be an old epoch which is recorded by the last OSD instance that run on this directory. Why use an old "current_epoch" to calculate the acting
 and up set of each pg?


Please help me, thank you:-)




聽












David聽Turner聽|
Cloud Operations Engineer聽|
StorageCraft
 Technology Corporation
380 Data Drive Suite 300聽|
Draper聽|
Utah聽|
84020
Office:
801.871.2760 |
Mobile:
385.224.2943










If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com











___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about OSDSuperblock

2016-10-22 Thread David Turner
The osd needs to know where it thought data was, in particular so it knows what 
it has. Then it gets the current map so it knows who to talk to so it can catch 
back up.

Sent from my iPhone

On Oct 22, 2016, at 7:12 AM, xxhdx1985126 
> wrote:

Hi, everyone.

I'm trying to read the source code that boots an OSD instance, and I find 
something really overwhelms me.
In the OSD::init() method, it read the OSDSuperblock by calling 
OSD::read_superblock(), and the it tried to get the "current" map : "osdmap = 
get_map(superblock.current_epoch)". Then OSD uses this osdmap to calculate the 
acting and up set of each pg.
I really don't understand this! Since the OSDSuperblock is read from the disk, 
the superblock.current_epoch should be an old epoch which is recorded by the 
last OSD instance that run on this directory. Why use an old "current_epoch" to 
calculate the acting and up set of each pg?

Please help me, thank you:-)







[cid:image15424c.JPG@ef7d796d.48843c49]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about OSDSuperblock

2016-10-22 Thread xxhdx1985126
Hi, everyone.


I'm trying to read the source code that boots an OSD instance, and I find 
something really overwhelms me.
In the OSD::init() method, it read the OSDSuperblock by calling 
OSD::read_superblock(), and the it tried to get the "current" map : "osdmap = 
get_map(superblock.current_epoch)". Then OSD uses this osdmap to calculate the 
acting and up set of each pg. 
I really don't understand this! Since the OSDSuperblock is read from the disk, 
the superblock.current_epoch should be an old epoch which is recorded by the 
last OSD instance that run on this directory. Why use an old "current_epoch" to 
calculate the acting and up set of each pg?


Please help me, thank you:-)___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on RGW MULTISITE and librados

2016-09-23 Thread Paul Nimbley
That’s what we inferred from reading but wanted to be sure the replication was 
occurring at the RGW layer and not the RADOS layer.  We haven't yet had a 
chance to test out multisite since we only have a single test cluster set up at 
the moment.  On the topic of rgw multisite if I can ask a few more questions.  
Is there information available anywhere as to how the functionality handles WAN 
latency and throughput?  Is there a way to configure throttling on the data 
replication?  Also is there any way to tell when objects/pools are in sync 
between two clusters/zones?

Thanks,
Paul

-Original Message-
From: Yehuda Sadeh-Weinraub [mailto:yeh...@redhat.com] 
Sent: Friday, September 23, 2016 10:44 AM
To: Paul Nimbley <paul.nimb...@infinite.com>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Question on RGW MULTISITE and librados

On Thu, Sep 22, 2016 at 1:52 PM, Paul Nimbley <paul.nimb...@infinite.com> wrote:
> Fairly new to ceph so please excuse any misused terminology.  We’re 
> currently exploring the use of ceph as a replacement storage backend 
> for an existing application.  The existing application has 2 
> requirements which seemingly can be met individually by using librados 
> and the Ceph Object Gateway multisite support, but it seems cannot be met 
> together.  These are:
>
>
>
> 1.   The ability to append to an existing object and read from any
> offset/length within the object, (the librados API allows this, the S3 
> and Swift APIs do not appear to support this).
>
> 2.   The ability to replicate data between 2 geographically separate
> locations.  I.e. 2 separate ceph clusters using the multisite support 
> of the Ceph Object Gateway to replicate between them.
>
>
>
> Specifically we’re testing using librados to write directly to the 
> object store because we need the ability to append to objects, which 
> using the librados API allows.  However if one writes to the object 
> store directly using the librados API is it correct to assume that 
> those objects will not be replicated to the other zone by the Ceph 
> Object Gateway since its being taken out of the data path?
>

The rgw multisite feature is an rgw only feature, and as such it doesn't apply 
to raw rados object operations. The rados gateway only handles its own data's 
replication, and it depends on its internal data structures and its different 
mechanics, so for raw rados replication there needs to be a different system in 
place.

Yehuda

>
>
> Thanks,
>
> Paul
>
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended 
> solely for the use of the addressee(s).
> If you are not the intended recipient, please notify so to the sender 
> by e-mail and delete the original message.
> In such cases, please notify us immediately at i...@infinite.com . 
> Further, you are not to copy, disclose, or distribute this e-mail or 
> its contents to any unauthorized
> person(s) .Any such actions are
> considered unlawful. This e-mail may contain viruses. Infinite has 
> taken every reasonable precaution to minimize this risk, but is not 
> liable for any damage you may sustain as a result of any virus in this 
> e-mail. You should carry out your own virus checks before opening the 
> e-mail or attachments.
> Infinite reserves the right to monitor and review the content of all 
> messages sent to or from this e-mail address.
> Messages sent to or from this e-mail
> address may be stored on the Infinite e-mail system.
> ***INFINITE End of DisclaimerINFINITE
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on RGW MULTISITE and librados

2016-09-23 Thread Yehuda Sadeh-Weinraub
On Thu, Sep 22, 2016 at 1:52 PM, Paul Nimbley  wrote:
> Fairly new to ceph so please excuse any misused terminology.  We’re
> currently exploring the use of ceph as a replacement storage backend for an
> existing application.  The existing application has 2 requirements which
> seemingly can be met individually by using librados and the Ceph Object
> Gateway multisite support, but it seems cannot be met together.  These are:
>
>
>
> 1.   The ability to append to an existing object and read from any
> offset/length within the object, (the librados API allows this, the S3 and
> Swift APIs do not appear to support this).
>
> 2.   The ability to replicate data between 2 geographically separate
> locations.  I.e. 2 separate ceph clusters using the multisite support of the
> Ceph Object Gateway to replicate between them.
>
>
>
> Specifically we’re testing using librados to write directly to the object
> store because we need the ability to append to objects, which using the
> librados API allows.  However if one writes to the object store directly
> using the librados API is it correct to assume that those objects will not
> be replicated to the other zone by the Ceph Object Gateway since its being
> taken out of the data path?
>

The rgw multisite feature is an rgw only feature, and as such it
doesn't apply to raw rados object operations. The rados gateway only
handles its own data's replication, and it depends on its internal
data structures and its different mechanics, so for raw rados
replication there needs to be a different system in place.

Yehuda

>
>
> Thanks,
>
> Paul
>
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s).
> If you are not the intended recipient, please notify so to the sender by
> e-mail and delete the original message.
> In such cases, please notify us immediately at i...@infinite.com . Further,
> you are not to copy,
> disclose, or distribute this e-mail or its contents to any unauthorized
> person(s) .Any such actions are
> considered unlawful. This e-mail may contain viruses. Infinite has taken
> every reasonable precaution to minimize
> this risk, but is not liable for any damage you may sustain as a result of
> any virus in this e-mail. You should
> carry out your own virus checks before opening the e-mail or attachments.
> Infinite reserves the right to monitor
> and review the content of all messages sent to or from this e-mail address.
> Messages sent to or from this e-mail
> address may be stored on the Infinite e-mail system.
> ***INFINITE End of DisclaimerINFINITE
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question on RGW MULTISITE and librados

2016-09-22 Thread Paul Nimbley
Fairly new to ceph so please excuse any misused terminology.  We're currently 
exploring the use of ceph as a replacement storage backend for an existing 
application.  The existing application has 2 requirements which seemingly can 
be met individually by using librados and the Ceph Object Gateway multisite 
support, but it seems cannot be met together.  These are:


1.   The ability to append to an existing object and read from any 
offset/length within the object, (the librados API allows this, the S3 and 
Swift APIs do not appear to support this).

2.   The ability to replicate data between 2 geographically separate 
locations.  I.e. 2 separate ceph clusters using the multisite support of the 
Ceph Object Gateway to replicate between them.

Specifically we're testing using librados to write directly to the object store 
because we need the ability to append to objects, which using the librados API 
allows.  However if one writes to the object store directly using the librados 
API is it correct to assume that those objects will not be replicated to the 
other zone by the Ceph Object Gateway since its being taken out of the data 
path?

Thanks,
Paul
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s).

If you are not the intended recipient, please notify so to the sender by e-mail 
and delete the original message.

In such cases, please notify us immediately at i...@infinite.com . Further, you 
are not to copy, 

disclose, or distribute this e-mail or its contents to any unauthorized 
person(s) .Any such actions are 

considered unlawful. This e-mail may contain viruses. Infinite has taken every 
reasonable precaution to minimize

this risk, but is not liable for any damage you may sustain as a result of any 
virus in this e-mail. You should 

carry out your own virus checks before opening the e-mail or attachments. 
Infinite reserves the right to monitor

and review the content of all messages sent to or from this e-mail address. 
Messages sent to or from this e-mail

address may be stored on the Infinite e-mail system. 

***INFINITE End of DisclaimerINFINITE 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about ceph-deploy osd create

2016-08-04 Thread Guillaume Comte
Ok i will try without creating by myself

Never the less thanks a lot Christian for your patience, i will try more
clever questions when i'm ready for them

Le 5 août 2016 02:44, "Christian Balzer"  a écrit :

Hello,

On Fri, 5 Aug 2016 02:41:47 +0200 Guillaume Comte wrote:

> Maybe you are mispelling, but in the docs they dont use white space but :
> this is quite misleading if it works
>
I'm quoting/showing "ceph-disk", which is called by ceph-deploy, which
indeed uses a ":".

Christian
> Le 5 août 2016 02:30, "Christian Balzer"  a écrit :
>
> >
> > Hello,
> >
> > On Fri, 5 Aug 2016 02:11:31 +0200 Guillaume Comte wrote:
> >
> > > I am reading half your answer
> > >
> > > Do you mean that ceph will create by itself the partitions for the
> > journal?
> > >
> > Yes, "man ceph-disk".
> >
> > > If so its cool and weird...
> > >
> > It can be very weird indeed.
> > If sdc is your data (OSD) disk and sdb your journal device then:
> >
> > "ceph-disk prepare /dev/sdc /dev/sdb1"
> > will not work, but:
> >
> > "ceph-disk prepare /dev/sdc /dev/sdb"
> > will and create a journal partition on sdb.
> > However you have no control over numbering or positioning this way.
> >
> > Christian
> >
> > > Le 5 août 2016 02:01, "Christian Balzer"  a écrit :
> > >
> > > >
> > > > Hello,
> > > >
> > > > you need to work on your google skills. ^_-
> > > >
> > > > I wrote about his just yesterday and if you search for "ceph-deploy
> > wrong
> > > > permission" the second link is the issue description:
> > > > http://tracker.ceph.com/issues/13833
> > > >
> > > > So I assume your journal partitions are either pre-made or non-GPT.
> > > >
> > > > Christian
> > > >
> > > > On Thu, 4 Aug 2016 15:34:44 +0200 Guillaume Comte wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > With ceph jewel,
> > > > >
> > > > > I'm pretty stuck with
> > > > >
> > > > >
> > > > > ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}]
> > > > >
> > > > > Because when i specify a journal path like this:
> > > > > ceph-deploy osd prepare ceph-osd1:sdd:sdf7
> > > > > And then:
> > > > > ceph-deploy osd activate ceph-osd1:sdd:sdf7
> > > > > I end up with "wrong permission" on the osd when activating,
> > complaining
> > > > > about "tmp" directory where the files are owned by root, and it
> > seems it
> > > > > tryes to do stuff as ceph user.
> > > > >
> > > > > It works when i don't specify a separate journal
> > > > >
> > > > > Any idea of what i'm doing wrong ?
> > > > >
> > > > > thks
> > > >
> > > >
> > > > --
> > > > Christian BalzerNetwork/Systems Engineer
> > > > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > > > http://www.gol.com/
> > > >
> >
> >
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >


--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about ceph-deploy osd create

2016-08-04 Thread Christian Balzer
Hello,

On Fri, 5 Aug 2016 02:41:47 +0200 Guillaume Comte wrote:

> Maybe you are mispelling, but in the docs they dont use white space but :
> this is quite misleading if it works
>
I'm quoting/showing "ceph-disk", which is called by ceph-deploy, which
indeed uses a ":".
 
Christian
> Le 5 août 2016 02:30, "Christian Balzer"  a écrit :
> 
> >
> > Hello,
> >
> > On Fri, 5 Aug 2016 02:11:31 +0200 Guillaume Comte wrote:
> >
> > > I am reading half your answer
> > >
> > > Do you mean that ceph will create by itself the partitions for the
> > journal?
> > >
> > Yes, "man ceph-disk".
> >
> > > If so its cool and weird...
> > >
> > It can be very weird indeed.
> > If sdc is your data (OSD) disk and sdb your journal device then:
> >
> > "ceph-disk prepare /dev/sdc /dev/sdb1"
> > will not work, but:
> >
> > "ceph-disk prepare /dev/sdc /dev/sdb"
> > will and create a journal partition on sdb.
> > However you have no control over numbering or positioning this way.
> >
> > Christian
> >
> > > Le 5 août 2016 02:01, "Christian Balzer"  a écrit :
> > >
> > > >
> > > > Hello,
> > > >
> > > > you need to work on your google skills. ^_-
> > > >
> > > > I wrote about his just yesterday and if you search for "ceph-deploy
> > wrong
> > > > permission" the second link is the issue description:
> > > > http://tracker.ceph.com/issues/13833
> > > >
> > > > So I assume your journal partitions are either pre-made or non-GPT.
> > > >
> > > > Christian
> > > >
> > > > On Thu, 4 Aug 2016 15:34:44 +0200 Guillaume Comte wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > With ceph jewel,
> > > > >
> > > > > I'm pretty stuck with
> > > > >
> > > > >
> > > > > ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}]
> > > > >
> > > > > Because when i specify a journal path like this:
> > > > > ceph-deploy osd prepare ceph-osd1:sdd:sdf7
> > > > > And then:
> > > > > ceph-deploy osd activate ceph-osd1:sdd:sdf7
> > > > > I end up with "wrong permission" on the osd when activating,
> > complaining
> > > > > about "tmp" directory where the files are owned by root, and it
> > seems it
> > > > > tryes to do stuff as ceph user.
> > > > >
> > > > > It works when i don't specify a separate journal
> > > > >
> > > > > Any idea of what i'm doing wrong ?
> > > > >
> > > > > thks
> > > >
> > > >
> > > > --
> > > > Christian BalzerNetwork/Systems Engineer
> > > > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > > > http://www.gol.com/
> > > >
> >
> >
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about ceph-deploy osd create

2016-08-04 Thread Guillaume Comte
Maybe you are mispelling, but in the docs they dont use white space but :
this is quite misleading if it works

Le 5 août 2016 02:30, "Christian Balzer"  a écrit :

>
> Hello,
>
> On Fri, 5 Aug 2016 02:11:31 +0200 Guillaume Comte wrote:
>
> > I am reading half your answer
> >
> > Do you mean that ceph will create by itself the partitions for the
> journal?
> >
> Yes, "man ceph-disk".
>
> > If so its cool and weird...
> >
> It can be very weird indeed.
> If sdc is your data (OSD) disk and sdb your journal device then:
>
> "ceph-disk prepare /dev/sdc /dev/sdb1"
> will not work, but:
>
> "ceph-disk prepare /dev/sdc /dev/sdb"
> will and create a journal partition on sdb.
> However you have no control over numbering or positioning this way.
>
> Christian
>
> > Le 5 août 2016 02:01, "Christian Balzer"  a écrit :
> >
> > >
> > > Hello,
> > >
> > > you need to work on your google skills. ^_-
> > >
> > > I wrote about his just yesterday and if you search for "ceph-deploy
> wrong
> > > permission" the second link is the issue description:
> > > http://tracker.ceph.com/issues/13833
> > >
> > > So I assume your journal partitions are either pre-made or non-GPT.
> > >
> > > Christian
> > >
> > > On Thu, 4 Aug 2016 15:34:44 +0200 Guillaume Comte wrote:
> > >
> > > > Hi All,
> > > >
> > > > With ceph jewel,
> > > >
> > > > I'm pretty stuck with
> > > >
> > > >
> > > > ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}]
> > > >
> > > > Because when i specify a journal path like this:
> > > > ceph-deploy osd prepare ceph-osd1:sdd:sdf7
> > > > And then:
> > > > ceph-deploy osd activate ceph-osd1:sdd:sdf7
> > > > I end up with "wrong permission" on the osd when activating,
> complaining
> > > > about "tmp" directory where the files are owned by root, and it
> seems it
> > > > tryes to do stuff as ceph user.
> > > >
> > > > It works when i don't specify a separate journal
> > > >
> > > > Any idea of what i'm doing wrong ?
> > > >
> > > > thks
> > >
> > >
> > > --
> > > Christian BalzerNetwork/Systems Engineer
> > > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about ceph-deploy osd create

2016-08-04 Thread Christian Balzer

Hello,

On Fri, 5 Aug 2016 02:11:31 +0200 Guillaume Comte wrote:

> I am reading half your answer
> 
> Do you mean that ceph will create by itself the partitions for the journal?
>
Yes, "man ceph-disk".
 
> If so its cool and weird...
>
It can be very weird indeed.
If sdc is your data (OSD) disk and sdb your journal device then:

"ceph-disk prepare /dev/sdc /dev/sdb1" 
will not work, but:

"ceph-disk prepare /dev/sdc /dev/sdb"
will and create a journal partition on sdb. 
However you have no control over numbering or positioning this way.
 
Christian

> Le 5 août 2016 02:01, "Christian Balzer"  a écrit :
> 
> >
> > Hello,
> >
> > you need to work on your google skills. ^_-
> >
> > I wrote about his just yesterday and if you search for "ceph-deploy wrong
> > permission" the second link is the issue description:
> > http://tracker.ceph.com/issues/13833
> >
> > So I assume your journal partitions are either pre-made or non-GPT.
> >
> > Christian
> >
> > On Thu, 4 Aug 2016 15:34:44 +0200 Guillaume Comte wrote:
> >
> > > Hi All,
> > >
> > > With ceph jewel,
> > >
> > > I'm pretty stuck with
> > >
> > >
> > > ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}]
> > >
> > > Because when i specify a journal path like this:
> > > ceph-deploy osd prepare ceph-osd1:sdd:sdf7
> > > And then:
> > > ceph-deploy osd activate ceph-osd1:sdd:sdf7
> > > I end up with "wrong permission" on the osd when activating, complaining
> > > about "tmp" directory where the files are owned by root, and it seems it
> > > tryes to do stuff as ceph user.
> > >
> > > It works when i don't specify a separate journal
> > >
> > > Any idea of what i'm doing wrong ?
> > >
> > > thks
> >
> >
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about ceph-deploy osd create

2016-08-04 Thread Guillaume Comte
I am reading half your answer

Do you mean that ceph will create by itself the partitions for the journal?

If so its cool and weird...

Le 5 août 2016 02:01, "Christian Balzer"  a écrit :

>
> Hello,
>
> you need to work on your google skills. ^_-
>
> I wrote about his just yesterday and if you search for "ceph-deploy wrong
> permission" the second link is the issue description:
> http://tracker.ceph.com/issues/13833
>
> So I assume your journal partitions are either pre-made or non-GPT.
>
> Christian
>
> On Thu, 4 Aug 2016 15:34:44 +0200 Guillaume Comte wrote:
>
> > Hi All,
> >
> > With ceph jewel,
> >
> > I'm pretty stuck with
> >
> >
> > ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}]
> >
> > Because when i specify a journal path like this:
> > ceph-deploy osd prepare ceph-osd1:sdd:sdf7
> > And then:
> > ceph-deploy osd activate ceph-osd1:sdd:sdf7
> > I end up with "wrong permission" on the osd when activating, complaining
> > about "tmp" directory where the files are owned by root, and it seems it
> > tryes to do stuff as ceph user.
> >
> > It works when i don't specify a separate journal
> >
> > Any idea of what i'm doing wrong ?
> >
> > thks
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about ceph-deploy osd create

2016-08-04 Thread Guillaume Comte
Yeah you are right

>From what i understand is that using a ceph is a good idea

But the fact is that it dont work

So i circumvent that by configuring ceph-deploy to use root

Was it the main goal, i dont think so

Thanks for your answer

Le 5 août 2016 02:01, "Christian Balzer"  a écrit :

>
> Hello,
>
> you need to work on your google skills. ^_-
>
> I wrote about his just yesterday and if you search for "ceph-deploy wrong
> permission" the second link is the issue description:
> http://tracker.ceph.com/issues/13833
>
> So I assume your journal partitions are either pre-made or non-GPT.
>
> Christian
>
> On Thu, 4 Aug 2016 15:34:44 +0200 Guillaume Comte wrote:
>
> > Hi All,
> >
> > With ceph jewel,
> >
> > I'm pretty stuck with
> >
> >
> > ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}]
> >
> > Because when i specify a journal path like this:
> > ceph-deploy osd prepare ceph-osd1:sdd:sdf7
> > And then:
> > ceph-deploy osd activate ceph-osd1:sdd:sdf7
> > I end up with "wrong permission" on the osd when activating, complaining
> > about "tmp" directory where the files are owned by root, and it seems it
> > tryes to do stuff as ceph user.
> >
> > It works when i don't specify a separate journal
> >
> > Any idea of what i'm doing wrong ?
> >
> > thks
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about ceph-deploy osd create

2016-08-04 Thread Christian Balzer

Hello,

you need to work on your google skills. ^_-

I wrote about his just yesterday and if you search for "ceph-deploy wrong
permission" the second link is the issue description:
http://tracker.ceph.com/issues/13833

So I assume your journal partitions are either pre-made or non-GPT.

Christian

On Thu, 4 Aug 2016 15:34:44 +0200 Guillaume Comte wrote:

> Hi All,
> 
> With ceph jewel,
> 
> I'm pretty stuck with
> 
> 
> ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}]
> 
> Because when i specify a journal path like this:
> ceph-deploy osd prepare ceph-osd1:sdd:sdf7
> And then:
> ceph-deploy osd activate ceph-osd1:sdd:sdf7
> I end up with "wrong permission" on the osd when activating, complaining
> about "tmp" directory where the files are owned by root, and it seems it
> tryes to do stuff as ceph user.
> 
> It works when i don't specify a separate journal
> 
> Any idea of what i'm doing wrong ?
> 
> thks


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question about ceph-deploy osd create

2016-08-04 Thread Guillaume Comte
Hi All,

With ceph jewel,

I'm pretty stuck with


ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}]

Because when i specify a journal path like this:
ceph-deploy osd prepare ceph-osd1:sdd:sdf7
And then:
ceph-deploy osd activate ceph-osd1:sdd:sdf7
I end up with "wrong permission" on the osd when activating, complaining
about "tmp" directory where the files are owned by root, and it seems it
tryes to do stuff as ceph user.

It works when i don't specify a separate journal

Any idea of what i'm doing wrong ?

thks
-- 
*Guillaume Comte*
06 25 85 02 02  | guillaume.co...@blade-group.com

90 avenue des Ternes, 75 017 Paris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on Sequential Write performance at 4K blocksize

2016-07-13 Thread Christian Balzer

Hello,

On Wed, 13 Jul 2016 18:15:10 + EP Komarla wrote:

> Hi All,
> 
> Have a question on the performance of sequential write @ 4K block sizes.
> 
Which version of Ceph?
Any significant ceph.conf modifications?

> Here is my configuration:
> 
> Ceph Cluster: 6 Nodes. Each node with :-
> 20x HDDs (OSDs) - 10K RPM 1.2 TB SAS disks
> SSDs - 4x - Intel S3710, 400GB; for OSD journals shared across 20 HDDs (i.e., 
> SSD journal ratio 1:5)
> 
> Network:
> - Client network - 10Gbps
> - Cluster network - 10Gbps
> - Each node with dual NIC - Intel 82599 ES - driver version 4.0.1
> 
> Traffic generators:
> 2 client servers - running on dual Intel sockets with 16 physical cores (32 
> cores with hyper-threading enabled)
> 
Are you mounting a RBD image on those servers via the kernel interface and
if so which kernel version?
Are you running the test inside a VM on those servers, or are you using
the RBD ioengine with fio?

> Test program:
> FIO - sequential read/write; random read/write

Exact fio command line please.


> Blocksizes - 4k, 32k, 256k...
> FIO - Number of jobs = 32; IO depth = 64
> Runtime = 10 minutes; Ramptime = 5 minutes
> Filesize = 4096g (5TB)
> 
> I observe that my sequential write performance at 4K block size is very low - 
> I am getting around 6MB/sec bandwidth.  The performance improves 
> significantly at larger block sizes (shown below)
> 
This is to some extend expected and normal.
You can see this behavior on local storage as well, just not as
pronounced.

Your main enemy here is latency, each write potentially needs to be sent
to the storage server(s, replication!) and then ACK'ed back to the client.

If your fio command line has sync writes (aka direct=1) things will be the
worst.

Small IOPs also stress your CPU's, look at atop on your storage nodes
during a 4KB fio run. 
That might also show other issues (as in overloaded HDDs/SSDs).

RBD caching (is it enabled on your clients?) can help with non-direct
writes.

That all being said, if I run this fio inside a VM (with RBD caching
enabled) against a cluster here with 4 nodes connected by QDR (40Gb/s)
Infiniband, 4x100GB DC S3700 and 8x plain SATA HDDs, I get:
---
# fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=write --name=fiojob --blocksize=4K --iodepth=32 

  write: io=4096.0MB, bw=134274KB/s, iops=33568 , runt= 31237msec

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=134273KB/s, minb=134273KB/s, maxb=134273KB/s, 
mint=31237msec, maxt=31237msec
---

And with buffered I/O (direct=0) I get:
---
  write: io=4096.0MB, bw=359194KB/s, iops=89798 , runt= 11677msec


Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=359193KB/s, minb=359193KB/s, maxb=359193KB/s, 
mint=11677msec, maxt=11677msec
---

Increasing numjobs of course reduces the performance per job, so numjob=2
will give half the speed per individual job.

So something is fishy with your setup, unless the 5.6MB/s below there are
the results PER JOB, which would make it 180MB/s with 32 jobs or even
360MB/s with 64 jobs and a pretty decent and expected result.

Christian

> FIO - Sequential Write test
> 
> Block Size
> 
> Sequential Write Bandwidth KB/Sec
> 
> 4K
> 
> 5694
> 
> 32K
> 
> 141020
> 
> 256K
> 
> 747421
> 
> 1024K
> 
> 602236
> 
> 4096K
> 
> 683029
> 
> 
> Here are my questions:
> - Why is the sequential write performance at 4K block size so low? Is this 
> in-line what others see?
> - Is it because of less number of clients, i.e., traffic generators? I am 
> planning to increase the number of clients to 4 servers.
> - There is a later version on NIC driver from Intel, v4.3.15 - do you think 
> upgrading to later version (v4.3.15) will improve performance?
> 
> Any thoughts or pointers will be helpful.
> 
> Thanks,
> 
> - epk
> 
> Legal Disclaimer:
> The information contained in this message may be privileged and confidential. 
> It is intended to be read only by the individual or entity to whom it is 
> addressed or by their designee. If the reader of this message is not the 
> intended recipient, you are on notice that any distribution of this message, 
> in any form, is strictly prohibited. If you have received this message in 
> error, please immediately notify the sender and delete or destroy any copy of 
> this message!


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question on Sequential Write performance at 4K blocksize

2016-07-13 Thread EP Komarla
Hi All,

Have a question on the performance of sequential write @ 4K block sizes.

Here is my configuration:

Ceph Cluster: 6 Nodes. Each node with :-
20x HDDs (OSDs) - 10K RPM 1.2 TB SAS disks
SSDs - 4x - Intel S3710, 400GB; for OSD journals shared across 20 HDDs (i.e., 
SSD journal ratio 1:5)

Network:
- Client network - 10Gbps
- Cluster network - 10Gbps
- Each node with dual NIC - Intel 82599 ES - driver version 4.0.1

Traffic generators:
2 client servers - running on dual Intel sockets with 16 physical cores (32 
cores with hyper-threading enabled)

Test program:
FIO - sequential read/write; random read/write
Blocksizes - 4k, 32k, 256k...
FIO - Number of jobs = 32; IO depth = 64
Runtime = 10 minutes; Ramptime = 5 minutes
Filesize = 4096g (5TB)

I observe that my sequential write performance at 4K block size is very low - I 
am getting around 6MB/sec bandwidth.  The performance improves significantly at 
larger block sizes (shown below)

FIO - Sequential Write test

Block Size

Sequential Write Bandwidth KB/Sec

4K

5694

32K

141020

256K

747421

1024K

602236

4096K

683029


Here are my questions:
- Why is the sequential write performance at 4K block size so low? Is this 
in-line what others see?
- Is it because of less number of clients, i.e., traffic generators? I am 
planning to increase the number of clients to 4 servers.
- There is a later version on NIC driver from Intel, v4.3.15 - do you think 
upgrading to later version (v4.3.15) will improve performance?

Any thoughts or pointers will be helpful.

Thanks,

- epk

Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about how to start ceph OSDs with systemd

2016-07-11 Thread Ernst Pijper
Hi Manual,

This is a well known issue. You are definitely not the first one to hit this 
problem. Before Jewel i (and other as well) added the line

ceph-disk activate all

to /etc/rc.local to get the OSD’s running at boot. In Jewel, however, this 
doesn’t work anymore. Now i add these line to /etc/rc.local

for dev in $(ceph-disk list | grep "ceph journal" | awk '{print $1}')
do
   ceph-disk trigger $dev
done

The OSD’s are now automatically mounted and started. After this, the systemctl 
commands also work.

Ernst


> On 8 jul. 2016, at 17:59, Manuel Lausch  wrote:
> 
> hi,
> 
> In the last days I do play around with ceph jewel on debian Jessie and CentOS 
> 7. Now I have a question about systemd on this Systems.
> 
> I installed ceph jewel (ceph version 10.2.2 
> (45107e21c568dd033c2f0a3107dec8f0b0e58374)) on debian Jessie and prepared 
> some OSDs. While playing around I decided to reinstall my operating system 
> (of course without deleting the OSD devices ). After the reinstallation of 
> ceph and put in the old ceph.conf I thought the previously prepared OSDs do 
> easily start and all will be fine after that.
> 
> With debian Wheezy and ceph firefly this worked well, but with the new 
> versions and systemd this doesn't work at all. Now what have I to do to get 
> the OSDs running again?
> 
> The following command didn't work and I didn't get any output from it.
>  systemctl start ceph-osd.target
> 
> And this is the output from systemctl status ceph-osd.target
> ● ceph-osd.target - ceph target allowing to start/stop all ceph-osd@.service 
> instances at once
>   Loaded: loaded (/usr/lib/systemd/system/ceph-osd.target; enabled; vendor 
> preset: enabled)
>   Active: active since Fri 2016-07-08 17:19:29 CEST; 36min ago
> 
> Jul 08 17:19:29 cs-dellbrick01.server.lan systemd[1]: Reached target ceph 
> target allowing to start/stop all ceph-osd@.service instances at once.
> Jul 08 17:19:29 cs-dellbrick01.server.lan systemd[1]: Starting ceph target 
> allowing to start/stop all ceph-osd@.service instances at once.
> Jul 08 17:31:15 cs-dellbrick01.server.lan systemd[1]: Reached target ceph 
> target allowing to start/stop all ceph-osd@.service instances at once.
> 
> 
> 
> thanks,
> Manuel
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about how to start ceph OSDs with systemd

2016-07-08 Thread Tom Barron


On 07/08/2016 11:59 AM, Manuel Lausch wrote:
> hi,
> 
> In the last days I do play around with ceph jewel on debian Jessie and
> CentOS 7. Now I have a question about systemd on this Systems.
> 
> I installed ceph jewel (ceph version 10.2.2
> (45107e21c568dd033c2f0a3107dec8f0b0e58374)) on debian Jessie and
> prepared some OSDs. While playing around I decided to reinstall my
> operating system (of course without deleting the OSD devices ). After
> the reinstallation of ceph and put in the old ceph.conf I thought the
> previously prepared OSDs do easily start and all will be fine after that.
> 
> With debian Wheezy and ceph firefly this worked well, but with the new
> versions and systemd this doesn't work at all. Now what have I to do to
> get the OSDs running again?
> 
> The following command didn't work and I didn't get any output from it.
>   systemctl start ceph-osd.target
> 
> And this is the output from systemctl status ceph-osd.target
> ● ceph-osd.target - ceph target allowing to start/stop all
> ceph-osd@.service instances at once
>Loaded: loaded (/usr/lib/systemd/system/ceph-osd.target; enabled;
> vendor preset: enabled)
>Active: active since Fri 2016-07-08 17:19:29 CEST; 36min ago
> 
> Jul 08 17:19:29 cs-dellbrick01.server.lan systemd[1]: Reached target
> ceph target allowing to start/stop all ceph-osd@.service instances at once.
> Jul 08 17:19:29 cs-dellbrick01.server.lan systemd[1]: Starting ceph
> target allowing to start/stop all ceph-osd@.service instances at once.
> Jul 08 17:31:15 cs-dellbrick01.server.lan systemd[1]: Reached target
> ceph target allowing to start/stop all ceph-osd@.service instances at once.
> 
> 

Manuel,

You may find our changes to the devstack ceph plugin here [1] for
systemd vs upstart vs sysvinit control of ceph services helpful.  We
tested against xenial and fedora24 for the systemd paths.

Cheers,

-- Tom

[1] https://review.openstack.org/#/c/332484/

> 
> thanks,
> Manuel
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about how to start ceph OSDs with systemd

2016-07-08 Thread Manuel Lausch

hi,

In the last days I do play around with ceph jewel on debian Jessie and 
CentOS 7. Now I have a question about systemd on this Systems.


I installed ceph jewel (ceph version 10.2.2 
(45107e21c568dd033c2f0a3107dec8f0b0e58374)) on debian Jessie and 
prepared some OSDs. While playing around I decided to reinstall my 
operating system (of course without deleting the OSD devices ). After 
the reinstallation of ceph and put in the old ceph.conf I thought the 
previously prepared OSDs do easily start and all will be fine after that.


With debian Wheezy and ceph firefly this worked well, but with the new 
versions and systemd this doesn't work at all. Now what have I to do to 
get the OSDs running again?


The following command didn't work and I didn't get any output from it.
  systemctl start ceph-osd.target

And this is the output from systemctl status ceph-osd.target
● ceph-osd.target - ceph target allowing to start/stop all 
ceph-osd@.service instances at once
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd.target; enabled; 
vendor preset: enabled)

   Active: active since Fri 2016-07-08 17:19:29 CEST; 36min ago

Jul 08 17:19:29 cs-dellbrick01.server.lan systemd[1]: Reached target 
ceph target allowing to start/stop all ceph-osd@.service instances at once.
Jul 08 17:19:29 cs-dellbrick01.server.lan systemd[1]: Starting ceph 
target allowing to start/stop all ceph-osd@.service instances at once.
Jul 08 17:31:15 cs-dellbrick01.server.lan systemd[1]: Reached target 
ceph target allowing to start/stop all ceph-osd@.service instances at once.




thanks,
Manuel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   >